[PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports
@ 2026-04-08 16:55 Peter Xu
  2026-04-08 16:55 ` [PATCH 01/14] migration: Fix low possibility downtime violation Peter Xu
                   ` (15 more replies)
  0 siblings, 16 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

CI: https://gitlab.com/peterx/qemu/-/pipelines/2437886506
rfc: https://lore.kernel.org/r/20260319231302.123135-1-peterx@redhat.com

This is v1 of this series.  I dropped RFC because I feel like I collected
enough feedback on previous version on what is uncertain, meanwhile I also
managed to borrow a system with nVidia RTX6000 2GB vGPU and tested it.

There're too many trivial things changed since RFC->v1 here, let me only
mention what majorly has changed:

- This version will assume both VFIO ioctls (reporting either precopy or
  stopcopy size) may report anything (say, garbage), and it shouldn't crash
  QEMU.  It will affect what got reported with downtime or remaining data,
  but that's best effort so it's expected.  With that in mind, I dropped
  patch 3 as Avihai suggested.  IOW, I expect no concern on either overflow
  / underflow or atomicity on reading these values from the VFIO drivers.

- The cached stopcopy_bytes for VFIO reflects always the total size
  (includes precopy sizes).

- Introduced one new patch to report "system-wide" remaining data, which
  will start to include VFIO remaining device data.  We can't squash that
  directly into "ram" section of query-migrate QMP results, so I introduced
  a new "remaining" field in query-migrate result for it.

- One more patch "migration: Make qemu_savevm_query_pending() available
  anytime" trying to fix a very hard to hit race condition I found when
  testing against virtio-net-failover tests.  I can only hit it if I run
  tens of concurrent tests, but it will be needed to fix a crash.

Otherwise the major things should be kept almost as-is.  I should also
addressed all comments I received from rfc version.  Please shoot if I
missed something.

Overview
========

VFIO migration was merged quite a while, but we do still see things off
here and there.  This series tries to address some of them, but only based
on my limited understandings.

Two major issues I wanted to resolve:

(1) VFIO reports state_pending_{exact|estimate}() differently

It reports stop-only sizes in exact() only (which includes both precopy and
stopcopy data), while in estimate() it only reports precopy data.  This is
violating the API.  It was done like it to trigger proper sync on the VFIO
ioctls only but it was only a workaround.  This series should fix it by
introducing stopcopy size reporting facility for vmstate handlers.

(2) expected_downtime / remaining doesn't take VFIO devices into account

When query migration, QEMU reports one field called "expected-downtime".
The document was phrasing this almost from RAM perspective, but ideally it
should be about an estimated blackout window (in milliseconds) if we
switchover anytime, based on known information.

This didn't yet took VFIO into account, especially in the case of VFIO
devices that may contain a large amount of device states (like GPUs).

For problem (2), the use case should be that an mgmt app when migrating a
VFIO GPU device needs to always adjust downtime for migration to converge,
because when it's involved normal downtime like 300ms will normally not
suffice.

Now the issue with that is the mgmt doesn't have a good way to know exactly
how well the precopy goes with the whole system and the GPU device.

The hope is fixed expected_downtime will provide one way for the mgmt app
to have a reasonable hint for downtime to setup to converge a migration.

Meanwhile, with a system-wise "remaining" field introduced, mgmt can query
this results at beginning of each iteration to know if a stall is
happening, IOW, if it's likely that this migration will not converge at
all.  When detected, mgmt can start to consider the expected_downtime value
reported above for converging this migration.  See more on testing below.

Tests
=====

Tested this series with an assigned VFIO device GRID RTX6000-2B, FB memory
2GB.

The test covers both correct reporting of system-wise remaining data (which
used to only cover RAM), and the expected downtime.  I verified that using
the expected downtime I can converge a VFIO migration immediately according
to the value reported. Test process as below:

Start the VM and kick off migration until it spins at the end, not
converging with default 300ms downtime.  It's common for a 2GB vGPU device
due to both huge stopsize reported and dramally small mbps reported.

As a start, update avail-switchover (I chose 1GB over a real 10Gbps port):

This will stablize bandwidth.

Libvirt's domjobinfo won't be able to see the real remaining data because
libvirt still doesn't support the new "remaining" field, however we can
still see expected_downtime will be reported correctly now (instead of
reporting zero, before this patch applied):

Data remaining:   0.000 B
Memory remaining: 0.000 B
Expected downtime: 1910         ms

If we peek through QEMU monitor, we'll see with the change the system-wise
remaining data to be 1.9GB (even if RAM keeps reporting 0), and expected
downtime keeps the same as what domjobinfo reports as 1.9 seconds:

Status:                 active
Time (ms):              total=336919, setup=10, exp_down=1910
Remaining (bytes):      1.91 GiB
RAM info:
  Throughput (Mbps):    460.09
  Sizes:                pagesize=4 KiB, total=32 GiB
  Transfers:            transferred=12.7 GiB, remain=0 B
    Channels:           precopy=12.7 GiB, multifd=0 B, postcopy=0 B, vfio=0 B
    Page Types:         normal=3306906, zero=7745576
  Page Rates (pps):     transfer=14010, dirty=8039
  Others:               dirty_syncs=247045

It means 1.91 seconds are required as lowest downtime per math.

We can try to set something lower than that, migration will not converge:

...
...

Then if we update downtime_limit to be slightly larger than expected downtime:

Migration will complete almost immediately.

Peter Xu (14):
  migration: Fix low possibility downtime violation
  migration/qapi: Rename MigrationStats to MigrationRAMStats
  vfio/migration: Cache stop size in VFIOMigration
  migration/treewide: Merge @state_pending_{exact|estimate} APIs
  migration: Use the new save_query_pending() API directly
  migration: Introduce stopcopy_bytes in save_query_pending()
  vfio/migration: Fix incorrect reporting for VFIO pending data
  migration: Make qemu_savevm_query_pending() available anytime
  migration: Move iteration counter out of RAM
  migration: Introduce a helper to return switchover bw estimate
  migration: Calculate expected downtime on demand
  migration: Fix calculation of expected_downtime to take VFIO info
  migration/qapi: Introduce system-wise "remaining" reports
  migration/qapi: Update unit for avail-switchover-bandwidth

 docs/about/removed-features.rst   |   2 +-
 docs/devel/migration/main.rst     |   9 +-
 docs/devel/migration/vfio.rst     |   9 +-
 qapi/migration.json               |  32 +++---
 hw/vfio/vfio-migration-internal.h |   8 ++
 include/migration/register.h      |  59 ++++------
 migration/migration-stats.h       |  13 ++-
 migration/migration.h             |  10 +-
 migration/savevm.h                |   9 +-
 hw/s390x/s390-stattrib.c          |   9 +-
 hw/vfio/migration.c               |  92 +++++++++-------
 migration/block-dirty-bitmap.c    |  10 +-
 migration/migration-hmp-cmds.c    |   5 +
 migration/migration.c             | 172 +++++++++++++++++++++---------
 migration/ram.c                   |  40 ++-----
 migration/savevm.c                |  73 +++++++------
 hw/vfio/trace-events              |   3 +-
 migration/trace-events            |   3 +-
 18 files changed, 313 insertions(+), 245 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 01/14] migration: Fix low possibility downtime violation
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-08 16:55 ` [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson, qemu-stable

When QEMU queried the estimated version of pending data and thinks it's
ready to converge, it'll send another accurate query to make sure of it.
It is needed to make sure we collect the latest reports and that equation
still holds true.

However we missed one tiny little difference here on "<" v.s. "<=" when
comparing pending_size (A) to threshold_size (B)..

QEMU src only re-query if A<B, but will kickoff switchover if A<=B.

I think it means it is possible to happen if A (as an estimate only so far)
accidentally equals to B, then re-query won't happen and switchover will
proceed without considering new dirtied data.

It turns out it was an accident in my commit 7aaa1fc072 when refactoring
the code around.  Fix this by using the same equation in both places.

Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
Cc: qemu-stable@nongnu.org
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..dfc60372cf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size < s->threshold_size) {
+        if (pending_size <= s->threshold_size) {
             qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
             pending_size = must_precopy + can_postcopy;
             trace_migrate_pending_exact(pending_size, must_precopy,
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-04-08 16:55 ` [PATCH 01/14] migration: Fix low possibility downtime violation Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:08   ` Juraj Marcin
  2026-04-10 11:10   ` Michal Prívozník
  2026-04-08 16:55 ` [PATCH 03/14] vfio/migration: Cache stop size in VFIOMigration Peter Xu
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson, devel

This stats is only about RAM, make it accurate.  This paves way for
statistics for all devices.

Thanks to Markus, who pointed out that docs/devel/qapi-code-gen.rst has a
section "Compatibility considerations" stated:

    Since type names are not visible in the Client JSON Protocol, types
    may be freely renamed.  Even certain refactorings are invisible, such
    as splitting members from one type into a common base type.

Hence this change is not ABI violation according to the document.

While at it, touch up the lines to make it read better, correct the
restriction on migration status being 'active' or 'completed': over time we
grew too many new status that will also report "ram" section.

Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: devel@lists.libvirt.org
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/about/removed-features.rst |  2 +-
 qapi/migration.json             | 10 +++++-----
 migration/migration-stats.h     |  2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index 557a24679a..ccd49b5615 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -699,7 +699,7 @@ was superseded by ``sections``.
 ``query-migrate`` return value member ``skipped`` (removed in 9.1)
 ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 
-Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
+Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
 for more than 10 years. Removed with no replacement.
 
 ``migrate`` command option ``inc`` (removed in 9.1)
diff --git a/qapi/migration.json b/qapi/migration.json
index 7134d4ce47..e3ad3f0604 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -12,7 +12,7 @@
 { 'include': 'sockets.json' }
 
 ##
-# @MigrationStats:
+# @MigrationRAMStats:
 #
 # Detailed migration status.
 #
@@ -64,7 +64,7 @@
 #
 # Since: 0.14
 ##
-{ 'struct': 'MigrationStats',
+{ 'struct': 'MigrationRAMStats',
   'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
            'duplicate': 'int',
            'normal': 'int',
@@ -209,8 +209,8 @@
 #     If this field is not returned, no migration process has been
 #     initiated
 #
-# @ram: `MigrationStats` containing detailed migration status, only
-#     returned if status is 'active' or 'completed'(since 1.2)
+# @ram: Detailed migration RAM statistics, only returned if migration
+#     is in progress or completed (since 1.2)
 #
 # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
 #     migration statistics, only returned if XBZRLE feature is on and
@@ -309,7 +309,7 @@
 # Since: 0.14
 ##
 { 'struct': 'MigrationInfo',
-  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
+  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
            '*vfio': 'VfioStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index c0f50144c9..1153520f7a 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -27,7 +27,7 @@
 
 /*
  * These are the ram migration statistic counters.  It is loosely
- * based on MigrationStats.
+ * based on MigrationRAMStats.
  */
 typedef struct {
     /*
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-04-08 16:55 ` [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
@ 2026-04-09 17:08   ` Juraj Marcin
  2026-04-10 11:10   ` Michal Prívozník
  1 sibling, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, devel

On 2026-04-08 12:55, Peter Xu wrote:
> This stats is only about RAM, make it accurate.  This paves way for
> statistics for all devices.
> 
> Thanks to Markus, who pointed out that docs/devel/qapi-code-gen.rst has a
> section "Compatibility considerations" stated:
> 
>     Since type names are not visible in the Client JSON Protocol, types
>     may be freely renamed.  Even certain refactorings are invisible, such
>     as splitting members from one type into a common base type.
> 
> Hence this change is not ABI violation according to the document.
> 
> While at it, touch up the lines to make it read better, correct the
> restriction on migration status being 'active' or 'completed': over time we
> grew too many new status that will also report "ram" section.
> 
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: devel@lists.libvirt.org
> Reviewed-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/about/removed-features.rst |  2 +-
>  qapi/migration.json             | 10 +++++-----
>  migration/migration-stats.h     |  2 +-
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-04-08 16:55 ` [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
  2026-04-09 17:08   ` Juraj Marcin
@ 2026-04-10 11:10   ` Michal Prívozník
  2026-04-15 16:09     ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Prívozník @ 2026-04-10 11:10 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Zhiyi Guo, Juraj Marcin, Prasad Pandit,
	Avihai Horon, Kirti Wankhede, Cédric Le Goater,
	Fabiano Rosas, Joao Martins, Markus Armbruster, Alex Williamson,
	devel

On 4/8/26 18:55, Peter Xu via Devel wrote:
> This stats is only about RAM, make it accurate.  This paves way for
> statistics for all devices.
> 
> Thanks to Markus, who pointed out that docs/devel/qapi-code-gen.rst has a
> section "Compatibility considerations" stated:
> 
>     Since type names are not visible in the Client JSON Protocol, types
>     may be freely renamed.  Even certain refactorings are invisible, such
>     as splitting members from one type into a common base type.
> 
> Hence this change is not ABI violation according to the document.
> 
> While at it, touch up the lines to make it read better, correct the
> restriction on migration status being 'active' or 'completed': over time we
> grew too many new status that will also report "ram" section.
> 
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: devel@lists.libvirt.org
> Reviewed-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/about/removed-features.rst |  2 +-
>  qapi/migration.json             | 10 +++++-----
>  migration/migration-stats.h     |  2 +-
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> index 557a24679a..ccd49b5615 100644
> --- a/docs/about/removed-features.rst
> +++ b/docs/about/removed-features.rst
> @@ -699,7 +699,7 @@ was superseded by ``sections``.
>  ``query-migrate`` return value member ``skipped`` (removed in 9.1)
>  ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
>  
> -Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
> +Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
>  for more than 10 years. Removed with no replacement.
>  
>  ``migrate`` command option ``inc`` (removed in 9.1)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 7134d4ce47..e3ad3f0604 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -12,7 +12,7 @@
>  { 'include': 'sockets.json' }
>  
>  ##
> -# @MigrationStats:
> +# @MigrationRAMStats:
>  #
>  # Detailed migration status.
>  #
> @@ -64,7 +64,7 @@
>  #
>  # Since: 0.14
>  ##
> -{ 'struct': 'MigrationStats',
> +{ 'struct': 'MigrationRAMStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int',
>             'normal': 'int',

Libvirt does not store/care about struct names really (only temporarily
when 'flattening' output of query-qmp-schema to see if a command has
certain argument, for instance). So from libvirt's POV:

Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Michal



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-04-10 11:10   ` Michal Prívozník
@ 2026-04-15 16:09     ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-15 16:09 UTC (permalink / raw)
  To: Michal Prívozník
  Cc: qemu-devel, Maciej S . Szmigiero, Zhiyi Guo, Juraj Marcin,
	Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, devel

On Fri, Apr 10, 2026 at 01:10:51PM +0200, Michal Prívozník wrote:
> Libvirt does not store/care about struct names really (only temporarily
> when 'flattening' output of query-qmp-schema to see if a command has
> certain argument, for instance). So from libvirt's POV:
> 
> Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Thanks for double checking!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 03/14] vfio/migration: Cache stop size in VFIOMigration
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-04-08 16:55 ` [PATCH 01/14] migration: Fix low possibility downtime violation Peter Xu
  2026-04-08 16:55 ` [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-13  9:52   ` Avihai Horon
  2026-04-08 16:55 ` [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

Add a field to cache stop size.  Note that there's an initial value change
in vfio_save_setup for the stop size default, but it shouldn't matter if it
is followed with a math of MIN() against VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE.

Document that all the three sizes we read from VFIO's uAPI on dirty or stop
sizes are estimates, so QEMU needs to always remember they can be anything.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/vfio-migration-internal.h |  8 +++++
 hw/vfio/migration.c               | 50 ++++++++++++++++++-------------
 2 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index 814fbd9eba..a15fc74703 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -45,8 +45,16 @@ typedef struct VFIOMigration {
     void *data_buffer;
     size_t data_buffer_size;
     uint64_t mig_flags;
+    /*
+     * NOTE: all three sizes cached are reported from VFIO's uAPI, which
+     * are defined as estimate only.  QEMU should not trust these values
+     * but only use them to do best-effort estimates.  Always be prepared
+     * that these sizes may either grow or even shrink in reality while
+     * read()ing from the VFIO fds.
+     */
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    uint64_t stopcopy_size;
     bool multifd_transfer;
     VFIOMultifd *multifd;
     bool initial_data_sent;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 83327b6573..5d5fca09bd 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -41,6 +41,12 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+/*
+ * Migration size of VFIO devices can be as little as a few KBs or as big as
+ * many GBs. This value should be big enough to cover the worst case.
+ */
+#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
+
 static unsigned long bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -314,8 +320,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
     migration->data_fd = -1;
 }
 
-static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
-                                     uint64_t *stop_copy_size)
+static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
 {
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
                               sizeof(struct vfio_device_feature_mig_data_size),
@@ -323,16 +328,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
     struct vfio_device_feature_mig_data_size *mig_data_size =
         (struct vfio_device_feature_mig_data_size *)feature->data;
+    VFIOMigration *migration = vbasedev->migration;
 
     feature->argsz = sizeof(buf);
     feature->flags =
         VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
 
     if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        /*
+         * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE
+         * is reported so downtime limit won't be violated.
+         */
+        migration->stopcopy_size = VFIO_MIG_STOP_COPY_SIZE;
         return -errno;
     }
 
-    *stop_copy_size = mig_data_size->stop_copy_length;
+    migration->stopcopy_size = mig_data_size->stop_copy_length;
 
     return 0;
 }
@@ -409,6 +420,16 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
         return;
     }
 
+    /*
+     * The total size remaining requires separate accounting.  Do not trust
+     * the counter, so what we have read() may be more than what reported.
+     */
+    if (migration->stopcopy_size > data_size) {
+        migration->stopcopy_size -= data_size;
+    } else {
+        migration->stopcopy_size = 0;
+    }
+
     if (migration->precopy_init_size) {
         uint64_t init_size = MIN(migration->precopy_init_size, data_size);
 
@@ -463,7 +484,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
     if (!vfio_multifd_setup(vbasedev, false, errp)) {
@@ -472,9 +492,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
-    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    vfio_query_stop_copy_size(vbasedev);
     migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
-                                      stop_copy_size);
+                                      migration->stopcopy_size);
     migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
     if (!migration->data_buffer) {
         error_setg(errp, "%s: Failed to allocate migration data buffer",
@@ -570,32 +590,22 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
                                       migration->precopy_dirty_size);
 }
 
-/*
- * Migration size of VFIO devices can be as little as a few KBs or as big as
- * many GBs. This value should be big enough to cover the worst case.
- */
-#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
-
 static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
-    /*
-     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
-     * reported so downtime limit won't be violated.
-     */
-    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
-    *must_precopy += stop_copy_size;
+    vfio_query_stop_copy_size(vbasedev);
+    *must_precopy += migration->stopcopy_size;
 
     if (vfio_device_state_is_precopy(vbasedev)) {
         vfio_query_precopy_size(migration);
     }
 
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->stopcopy_size,
+                                   migration->precopy_init_size,
                                    migration->precopy_dirty_size);
 }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 03/14] vfio/migration: Cache stop size in VFIOMigration
  2026-04-08 16:55 ` [PATCH 03/14] vfio/migration: Cache stop size in VFIOMigration Peter Xu
@ 2026-04-13  9:52   ` Avihai Horon
  0 siblings, 0 replies; 52+ messages in thread
From: Avihai Horon @ 2026-04-13  9:52 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson


On 4/8/2026 7:55 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Add a field to cache stop size.  Note that there's an initial value change
> in vfio_save_setup for the stop size default, but it shouldn't matter if it
> is followed with a math of MIN() against VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE.
>
> Document that all the three sizes we read from VFIO's uAPI on dirty or stop
> sizes are estimates, so QEMU needs to always remember they can be anything.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/vfio-migration-internal.h |  8 +++++
>   hw/vfio/migration.c               | 50 ++++++++++++++++++-------------
>   2 files changed, 38 insertions(+), 20 deletions(-)

Reviewed-by: Avihai Horon <avihaih@nvidia.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (2 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 03/14] vfio/migration: Cache stop size in VFIOMigration Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:10   ` Juraj Marcin
                     ` (2 more replies)
  2026-04-08 16:55 ` [PATCH 05/14] migration: Use the new save_query_pending() API directly Peter Xu
                   ` (11 subsequent siblings)
  15 siblings, 3 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

These two APIs are a slight duplication.  For example, there're a few users
that directly pass in the same function.

It might also be error prone to provide two hooks, so that it's easier to
happen one module report different things via the two hooks.

In reality, they should always report the same thing, only about whether we
should use a fast-path when the slow path might be too slow, as QEMU may
query these information quite frequently during migration process.

Merge it into one API, provide a bool showing if the query is an exact
query or not.  No functional change intended.

Export qemu_savevm_query_pending().  We should use the new API here
provided when there're new users to do the query.  This will happen very
soon.

Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Jason Herne <jjherne@linux.ibm.com>
Cc: Eric Farman <farman@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Cc: John Snow <jsnow@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/devel/migration/main.rst  |  9 ++----
 docs/devel/migration/vfio.rst  |  9 ++----
 include/migration/register.h   | 52 +++++++++++-----------------------
 migration/savevm.h             |  3 ++
 hw/s390x/s390-stattrib.c       |  9 +++---
 hw/vfio/migration.c            | 48 ++++++++++++++-----------------
 migration/block-dirty-bitmap.c | 10 +++----
 migration/ram.c                | 33 +++++++--------------
 migration/savevm.c             | 42 +++++++++++++--------------
 hw/vfio/trace-events           |  3 +-
 10 files changed, 84 insertions(+), 134 deletions(-)

diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
index 234d280249..e6a6ca3681 100644
--- a/docs/devel/migration/main.rst
+++ b/docs/devel/migration/main.rst
@@ -515,13 +515,8 @@ An iterative device must provide:
   - A ``load_setup`` function that initialises the data structures on the
     destination.
 
-  - A ``state_pending_exact`` function that indicates how much more
-    data we must save.  The core migration code will use this to
-    determine when to pause the CPUs and complete the migration.
-
-  - A ``state_pending_estimate`` function that indicates how much more
-    data we must save.  When the estimated amount is smaller than the
-    threshold, we call ``state_pending_exact``.
+  - A ``save_query_pending`` function that indicates how much more
+    data we must save.
 
   - A ``save_live_iterate`` function should send a chunk of data until
     the point that stream bandwidth limits tell it to stop.  Each call
diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 0790e5031d..33768c877c 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
-* A ``state_pending_estimate`` function that reports an estimate of the
-  remaining pre-copy data that the vendor driver has yet to save for the VFIO
-  device.
-
-* A ``state_pending_exact`` function that reads pending_bytes from the vendor
-  driver, which indicates the amount of data that the vendor driver has yet to
-  save for the VFIO device.
+* A ``save_query_pending`` function that reports the remaining pre-copy
+  data that the vendor driver has yet to save for the VFIO device.
 
 * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
   active only when the VFIO device is in pre-copy states.
diff --git a/include/migration/register.h b/include/migration/register.h
index d0f37f5f43..aba3c9af2f 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -16,6 +16,13 @@
 
 #include "hw/core/vmstate-if.h"
 
+typedef struct MigPendingData {
+    /* Amount of pending bytes can be transferred in precopy or stopcopy */
+    uint64_t precopy_bytes;
+    /* Amount of pending bytes can be transferred in postcopy */
+    uint64_t postcopy_bytes;
+} MigPendingData;
+
 /**
  * struct SaveVMHandlers: handler structure to finely control
  * migration of complex subsystems and devices, such as RAM, block and
@@ -197,46 +204,19 @@ typedef struct SaveVMHandlers {
     bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
 
     /**
-     * @state_pending_estimate
-     *
-     * This estimates the remaining data to transfer
-     *
-     * Sum of @can_postcopy and @must_postcopy is the whole amount of
-     * pending data.
-     *
-     * @opaque: data pointer passed to register_savevm_live()
-     * @must_precopy: amount of data that must be migrated in precopy
-     *                or in stopped state, i.e. that must be migrated
-     *                before target start.
-     * @can_postcopy: amount of data that can be migrated in postcopy
-     *                or in stopped state, i.e. after target start.
-     *                Some can also be migrated during precopy (RAM).
-     *                Some must be migrated after source stops
-     *                (block-dirty-bitmap)
-     */
-    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
-                                   uint64_t *can_postcopy);
-
-    /**
-     * @state_pending_exact
-     *
-     * This calculates the exact remaining data to transfer
+     * @save_query_pending
      *
-     * Sum of @can_postcopy and @must_postcopy is the whole amount of
-     * pending data.
+     * This estimates the remaining data to transfer on the source side.
+     * It's highly suggested that the module should implement both fastpath
+     * and slowpath version of it when it can be slow (for more information
+     * please check pending->fastpath field).
      *
      * @opaque: data pointer passed to register_savevm_live()
-     * @must_precopy: amount of data that must be migrated in precopy
-     *                or in stopped state, i.e. that must be migrated
-     *                before target start.
-     * @can_postcopy: amount of data that can be migrated in postcopy
-     *                or in stopped state, i.e. after target start.
-     *                Some can also be migrated during precopy (RAM).
-     *                Some must be migrated after source stops
-     *                (block-dirty-bitmap)
+     * @pending: pointer to a MigPendingData struct
+     * @exact: set true for an accurate (slow) query
      */
-    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
-                                uint64_t *can_postcopy);
+    void (*save_query_pending)(void *opaque, MigPendingData *pending,
+                               bool exact);
 
     /**
      * @load_state
diff --git a/migration/savevm.h b/migration/savevm.h
index b3d1e8a13c..e4efd243f3 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -14,6 +14,8 @@
 #ifndef MIGRATION_SAVEVM_H
 #define MIGRATION_SAVEVM_H
 
+#include "migration/register.h"
+
 #define QEMU_VM_FILE_MAGIC           0x5145564d
 #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
 #define QEMU_VM_FILE_VERSION         0x00000003
@@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
+void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
 void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                                      uint64_t *can_postcopy);
 void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index d808ece3b9..a22469a9e9 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -187,15 +187,15 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
     return 0;
 }
 
-static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
-                               uint64_t *can_postcopy)
+static void cmma_state_pending(void *opaque, MigPendingData *pending,
+                               bool exact)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
     S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
     long long res = sac->get_dirtycount(sas);
 
     if (res >= 0) {
-        *must_precopy += res;
+        pending->precopy_bytes += res;
     }
 }
 
@@ -340,8 +340,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
     .save_setup = cmma_save_setup,
     .save_live_iterate = cmma_save_iterate,
     .save_complete = cmma_save_complete,
-    .state_pending_exact = cmma_state_pending,
-    .state_pending_estimate = cmma_state_pending,
+    .save_query_pending = cmma_state_pending,
     .save_cleanup = cmma_save_cleanup,
     .load_state = cmma_load,
     .is_active = cmma_active,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 5d5fca09bd..1e999f0040 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -571,42 +571,39 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
+static void vfio_state_pending_sync(VFIODevice *vbasedev)
 {
-    VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
-    if (!vfio_device_state_is_precopy(vbasedev)) {
-        return;
-    }
-
-    *must_precopy +=
-        migration->precopy_init_size + migration->precopy_dirty_size;
+    vfio_query_stop_copy_size(vbasedev);
 
-    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
-                                      *can_postcopy,
-                                      migration->precopy_init_size,
-                                      migration->precopy_dirty_size);
+    if (vfio_device_state_is_precopy(vbasedev)) {
+        vfio_query_precopy_size(migration);
+    }
 }
 
-static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
-                                     uint64_t *can_postcopy)
+static void vfio_state_pending(void *opaque, MigPendingData *pending,
+                               bool exact)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
+    uint64_t remain;
 
-    vfio_query_stop_copy_size(vbasedev);
-    *must_precopy += migration->stopcopy_size;
-
-    if (vfio_device_state_is_precopy(vbasedev)) {
-        vfio_query_precopy_size(migration);
+    if (exact) {
+        vfio_state_pending_sync(vbasedev);
+        remain = migration->stopcopy_size;
+    } else {
+        if (!vfio_device_state_is_precopy(vbasedev)) {
+            return;
+        }
+        remain = migration->precopy_init_size + migration->precopy_dirty_size;
     }
 
-    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   migration->stopcopy_size,
-                                   migration->precopy_init_size,
-                                   migration->precopy_dirty_size);
+    pending->precopy_bytes += remain;
+
+    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
+                             migration->precopy_init_size,
+                             migration->precopy_dirty_size);
 }
 
 static bool vfio_is_active_iterate(void *opaque)
@@ -851,8 +848,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .save_prepare = vfio_save_prepare,
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
-    .state_pending_estimate = vfio_state_pending_estimate,
-    .state_pending_exact = vfio_state_pending_exact,
+    .save_query_pending = vfio_state_pending,
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
     .save_complete = vfio_save_complete_precopy,
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index a061aad817..15d417013c 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -766,9 +766,8 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void dirty_bitmap_state_pending(void *opaque,
-                                       uint64_t *must_precopy,
-                                       uint64_t *can_postcopy)
+static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
+                                       bool exact)
 {
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms;
@@ -788,7 +787,7 @@ static void dirty_bitmap_state_pending(void *opaque,
 
     trace_dirty_bitmap_state_pending(pending);
 
-    *can_postcopy += pending;
+    data->postcopy_bytes += pending;
 }
 
 /* First occurrence of this bitmap. It should be created if doesn't exist */
@@ -1250,8 +1249,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
     .save_setup = dirty_bitmap_save_setup,
     .save_complete = dirty_bitmap_save_complete,
     .has_postcopy = dirty_bitmap_has_postcopy,
-    .state_pending_exact = dirty_bitmap_state_pending,
-    .state_pending_estimate = dirty_bitmap_state_pending,
+    .save_query_pending = dirty_bitmap_state_pending,
     .save_live_iterate = dirty_bitmap_save_iterate,
     .is_active_iterate = dirty_bitmap_is_active_iterate,
     .load_state = dirty_bitmap_load,
diff --git a/migration/ram.c b/migration/ram.c
index 979751f61b..e5b7217bf5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3443,30 +3443,18 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return qemu_fflush(f);
 }
 
-static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
-                                       uint64_t *can_postcopy)
-{
-    RAMState **temp = opaque;
-    RAMState *rs = *temp;
-
-    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
-
-    if (migrate_postcopy_ram()) {
-        /* We can do postcopy, and all the data is postcopiable */
-        *can_postcopy += remaining_size;
-    } else {
-        *must_precopy += remaining_size;
-    }
-}
-
-static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
-                                    uint64_t *can_postcopy)
+static void ram_state_pending(void *opaque, MigPendingData *pending,
+                              bool exact)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
     uint64_t remaining_size;
 
-    if (!migration_in_postcopy()) {
+    /*
+     * Sync is not needed either with: (1) a fast query, or (2) after
+     * postcopy has started (no new dirty will generate anymore).
+     */
+    if (exact && !migration_in_postcopy()) {
         bql_lock();
         WITH_RCU_READ_LOCK_GUARD() {
             migration_bitmap_sync_precopy(false);
@@ -3478,9 +3466,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
 
     if (migrate_postcopy_ram()) {
         /* We can do postcopy, and all the data is postcopiable */
-        *can_postcopy += remaining_size;
+        pending->postcopy_bytes += remaining_size;
     } else {
-        *must_precopy += remaining_size;
+        pending->precopy_bytes += remaining_size;
     }
 }
 
@@ -4703,8 +4691,7 @@ static SaveVMHandlers savevm_ram_handlers = {
     .save_live_iterate = ram_save_iterate,
     .save_complete = ram_save_complete,
     .has_postcopy = ram_has_postcopy,
-    .state_pending_exact = ram_state_pending_exact,
-    .state_pending_estimate = ram_state_pending_estimate,
+    .save_query_pending = ram_state_pending,
     .load_state = ram_load,
     .save_cleanup = ram_save_cleanup,
     .load_setup = ram_load_setup,
diff --git a/migration/savevm.c b/migration/savevm.c
index dd58f2a705..392d840955 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1762,46 +1762,44 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
     return qemu_fflush(f);
 }
 
-/* Give an estimate of the amount left to be transferred,
- * the result is split into the amount for units that can and
- * for units that can't do postcopy.
- */
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
+void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
 {
     SaveStateEntry *se;
 
-    *must_precopy = 0;
-    *can_postcopy = 0;
+    pending->precopy_bytes = 0;
+    pending->postcopy_bytes = 0;
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->state_pending_estimate) {
+        if (!se->ops || !se->ops->save_query_pending) {
             continue;
         }
         if (!qemu_savevm_state_active(se)) {
             continue;
         }
-        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
+        se->ops->save_query_pending(se->opaque, pending, exact);
     }
 }
 
+void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    MigPendingData pending;
+
+    qemu_savevm_query_pending(&pending, false);
+
+    *must_precopy = pending.precopy_bytes;
+    *can_postcopy = pending.postcopy_bytes;
+}
+
 void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
-    SaveStateEntry *se;
+    MigPendingData pending;
 
-    *must_precopy = 0;
-    *can_postcopy = 0;
+    qemu_savevm_query_pending(&pending, true);
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->state_pending_exact) {
-            continue;
-        }
-        if (!qemu_savevm_state_active(se)) {
-            continue;
-        }
-        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
-    }
+    *must_precopy = pending.precopy_bytes;
+    *can_postcopy = pending.postcopy_bytes;
 }
 
 void qemu_savevm_state_cleanup(void)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 846e3625c5..7cf5a9eb2d 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
-vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
+vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-08 16:55 ` [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
@ 2026-04-09 17:10   ` Juraj Marcin
  2026-04-15 16:23     ` Peter Xu
  2026-04-13  9:57   ` Avihai Horon
  2026-04-16 14:18   ` Jason J. Herne
  2 siblings, 1 reply; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

Hi Peter,

On 2026-04-08 12:55, Peter Xu wrote:
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
> 
> It might also be error prone to provide two hooks, so that it's easier to
> happen one module report different things via the two hooks.
> 
> In reality, they should always report the same thing, only about whether we
> should use a fast-path when the slow path might be too slow, as QEMU may
> query these information quite frequently during migration process.
> 
> Merge it into one API, provide a bool showing if the query is an exact
> query or not.  No functional change intended.
> 
> Export qemu_savevm_query_pending().  We should use the new API here
> provided when there're new users to do the query.  This will happen very
> soon.
> 
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/devel/migration/main.rst  |  9 ++----
>  docs/devel/migration/vfio.rst  |  9 ++----
>  include/migration/register.h   | 52 +++++++++++-----------------------
>  migration/savevm.h             |  3 ++
>  hw/s390x/s390-stattrib.c       |  9 +++---
>  hw/vfio/migration.c            | 48 ++++++++++++++-----------------
>  migration/block-dirty-bitmap.c | 10 +++----
>  migration/ram.c                | 33 +++++++--------------
>  migration/savevm.c             | 42 +++++++++++++--------------
>  hw/vfio/trace-events           |  3 +-
>  10 files changed, 84 insertions(+), 134 deletions(-)
> 
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..e6a6ca3681 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -515,13 +515,8 @@ An iterative device must provide:
>    - A ``load_setup`` function that initialises the data structures on the
>      destination.
>  
> -  - A ``state_pending_exact`` function that indicates how much more
> -    data we must save.  The core migration code will use this to
> -    determine when to pause the CPUs and complete the migration.
> -
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>  
>    - A ``save_live_iterate`` function should send a chunk of data until
>      the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>  * A ``load_setup`` function that sets the VFIO device on the destination in
>    _RESUMING state.
>  
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.
>  
>  * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>    active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..aba3c9af2f 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,13 @@
>  
>  #include "hw/core/vmstate-if.h"
>  
> +typedef struct MigPendingData {
> +    /* Amount of pending bytes can be transferred in precopy or stopcopy */
> +    uint64_t precopy_bytes;
> +    /* Amount of pending bytes can be transferred in postcopy */
> +    uint64_t postcopy_bytes;
> +} MigPendingData;
> +
>  /**
>   * struct SaveVMHandlers: handler structure to finely control
>   * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +204,19 @@ typedef struct SaveVMHandlers {
>      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>  
>      /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>       *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).

There is no pending->fastpath field anymore.

>       *
>       * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
> +     * @exact: set true for an accurate (slow) query
>       */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> +                               bool exact);
>  
>      /**
>       * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..e4efd243f3 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>  #ifndef MIGRATION_SAVEVM_H
>  #define MIGRATION_SAVEVM_H
>  
> +#include "migration/register.h"
> +
>  #define QEMU_VM_FILE_MAGIC           0x5145564d
>  #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>  #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy);
>  void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..a22469a9e9 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,15 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>      return 0;
>  }
>  
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending,
> +                               bool exact)
>  {
>      S390StAttribState *sas = S390_STATTRIB(opaque);
>      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>      long long res = sac->get_dirtycount(sas);
>  
>      if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>      }
>  }
>  
> @@ -340,8 +340,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>      .save_setup = cmma_save_setup,
>      .save_live_iterate = cmma_save_iterate,
>      .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>      .save_cleanup = cmma_save_cleanup,
>      .load_state = cmma_load,
>      .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5d5fca09bd..1e999f0040 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -571,42 +571,39 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>  {
> -    VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> -
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    vfio_query_stop_copy_size(vbasedev);
>  
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>  }
>  
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending,
> +                               bool exact)
>  {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>  
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (exact) {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
> +    } else {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
>      }
>  
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);
>  }
>  
>  static bool vfio_is_active_iterate(void *opaque)
> @@ -851,8 +848,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>      .save_prepare = vfio_save_prepare,
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>      .is_active_iterate = vfio_is_active_iterate,
>      .save_live_iterate = vfio_save_iterate,
>      .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..15d417013c 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,8 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
> +                                       bool exact)
>  {
>      DBMSaveState *s = &((DBMState *)opaque)->save;
>      SaveBitmapState *dbms;
> @@ -788,7 +787,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>  
>      trace_dirty_bitmap_state_pending(pending);
>  
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>  }
>  
>  /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1249,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>      .save_setup = dirty_bitmap_save_setup,
>      .save_complete = dirty_bitmap_save_complete,
>      .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>      .save_live_iterate = dirty_bitmap_save_iterate,
>      .is_active_iterate = dirty_bitmap_is_active_iterate,
>      .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..e5b7217bf5 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,18 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      return qemu_fflush(f);
>  }
>  
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending,
> +                              bool exact)
>  {
>      RAMState **temp = opaque;
>      RAMState *rs = *temp;
>      uint64_t remaining_size;
>  
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is not needed either with: (1) a fast query, or (2) after
> +     * postcopy has started (no new dirty will generate anymore).
> +     */
> +    if (exact && !migration_in_postcopy()) {
>          bql_lock();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3466,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>  
>      if (migrate_postcopy_ram()) {
>          /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>      } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>      }
>  }
>  
> @@ -4703,8 +4691,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_iterate = ram_save_iterate,
>      .save_complete = ram_save_complete,
>      .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>      .load_state = ram_load,
>      .save_cleanup = ram_save_cleanup,
>      .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..392d840955 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,44 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>      return qemu_fflush(f);
>  }
>  
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>  {
>      SaveStateEntry *se;
>  
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
>  
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>              continue;
>          }
>          if (!qemu_savevm_state_active(se)) {
>              continue;
>          }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending, exact);
>      }
>  }
>  
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, false);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
>  {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>  
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, true);
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>  }
>  
>  void qemu_savevm_state_cleanup(void)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 846e3625c5..7cf5a9eb2d 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
>  vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>  vfio_save_iterate_start(const char *name) " (%s)"
>  vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
> -vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> +vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>  vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>  vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>  
> -- 
> 2.53.0
> 

Apart from that one comment above, it looks good to me!

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-09 17:10   ` Juraj Marcin
@ 2026-04-15 16:23     ` Peter Xu
  2026-04-16  8:24       ` Juraj Marcin
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-15 16:23 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

On Thu, Apr 09, 2026 at 07:10:56PM +0200, Juraj Marcin wrote:
> > -    /**
> > -     * @state_pending_exact
> > -     *
> > -     * This calculates the exact remaining data to transfer
> > +     * @save_query_pending
> >       *
> > -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > -     * pending data.
> > +     * This estimates the remaining data to transfer on the source side.
> > +     * It's highly suggested that the module should implement both fastpath
> > +     * and slowpath version of it when it can be slow (for more information
> > +     * please check pending->fastpath field).
> 
> There is no pending->fastpath field anymore.

Yep I missed it..  When at it, I rewrote this part to make it read better,
latest version now:

    /**
     * @save_query_pending
     *
     * This estimates the remaining data to transfer on the source side.
     *
     * When @exact is true, a module must report accurate results.  When
     * @exact is false, a module may report estimates.
     *
     * It's highly recommended that modules implement a faster version of
     * the query path (for example, by proper caching on the counters) if
     * an accurate query will be time-consuming.
     *
     * @opaque: data pointer passed to register_savevm_live()
     * @pending: pointer to a MigPendingData struct
     * @exact: set to true for an accurate (slow) query
     */


[...]

> Apart from that one comment above, it looks good to me!
> 
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>

Thanks, Juraj!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-15 16:23     ` Peter Xu
@ 2026-04-16  8:24       ` Juraj Marcin
  0 siblings, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-16  8:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

On 2026-04-15 12:23, Peter Xu wrote:
> On Thu, Apr 09, 2026 at 07:10:56PM +0200, Juraj Marcin wrote:
> > > -    /**
> > > -     * @state_pending_exact
> > > -     *
> > > -     * This calculates the exact remaining data to transfer
> > > +     * @save_query_pending
> > >       *
> > > -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > > -     * pending data.
> > > +     * This estimates the remaining data to transfer on the source side.
> > > +     * It's highly suggested that the module should implement both fastpath
> > > +     * and slowpath version of it when it can be slow (for more information
> > > +     * please check pending->fastpath field).
> > 
> > There is no pending->fastpath field anymore.
> 
> Yep I missed it..  When at it, I rewrote this part to make it read better,
> latest version now:
> 
>     /**
>      * @save_query_pending
>      *
>      * This estimates the remaining data to transfer on the source side.
>      *
>      * When @exact is true, a module must report accurate results.  When
>      * @exact is false, a module may report estimates.
>      *
>      * It's highly recommended that modules implement a faster version of
>      * the query path (for example, by proper caching on the counters) if
>      * an accurate query will be time-consuming.
>      *
>      * @opaque: data pointer passed to register_savevm_live()
>      * @pending: pointer to a MigPendingData struct
>      * @exact: set to true for an accurate (slow) query
>      */
>

Looks good, thanks!

> 
> [...]
> 
> > Apart from that one comment above, it looks good to me!
> > 
> > Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
> 
> Thanks, Juraj!
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-08 16:55 ` [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
  2026-04-09 17:10   ` Juraj Marcin
@ 2026-04-13  9:57   ` Avihai Horon
  2026-04-16 14:01     ` Peter Xu
  2026-04-16 14:18   ` Jason J. Herne
  2 siblings, 1 reply; 52+ messages in thread
From: Avihai Horon @ 2026-04-13  9:57 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow


On 4/8/2026 7:55 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
>
> It might also be error prone to provide two hooks, so that it's easier to
> happen one module report different things via the two hooks.
>
> In reality, they should always report the same thing, only about whether we
> should use a fast-path when the slow path might be too slow, as QEMU may
> query these information quite frequently during migration process.
>
> Merge it into one API, provide a bool showing if the query is an exact
> query or not.  No functional change intended.
>
> Export qemu_savevm_query_pending().  We should use the new API here
> provided when there're new users to do the query.  This will happen very
> soon.
>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   docs/devel/migration/main.rst  |  9 ++----
>   docs/devel/migration/vfio.rst  |  9 ++----
>   include/migration/register.h   | 52 +++++++++++-----------------------
>   migration/savevm.h             |  3 ++
>   hw/s390x/s390-stattrib.c       |  9 +++---
>   hw/vfio/migration.c            | 48 ++++++++++++++-----------------
>   migration/block-dirty-bitmap.c | 10 +++----
>   migration/ram.c                | 33 +++++++--------------
>   migration/savevm.c             | 42 +++++++++++++--------------
>   hw/vfio/trace-events           |  3 +-
>   10 files changed, 84 insertions(+), 134 deletions(-)
>
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..e6a6ca3681 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -515,13 +515,8 @@ An iterative device must provide:
>     - A ``load_setup`` function that initialises the data structures on the
>       destination.
>
> -  - A ``state_pending_exact`` function that indicates how much more
> -    data we must save.  The core migration code will use this to
> -    determine when to pause the CPUs and complete the migration.
> -
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>
>     - A ``save_live_iterate`` function should send a chunk of data until
>       the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``load_setup`` function that sets the VFIO device on the destination in
>     _RESUMING state.
>
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.

Maybe drop "pre-copy" since we also report stopcopy bytes?

>
>   * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>     active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..aba3c9af2f 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,13 @@
>
>   #include "hw/core/vmstate-if.h"
>
> +typedef struct MigPendingData {
> +    /* Amount of pending bytes can be transferred in precopy or stopcopy */
> +    uint64_t precopy_bytes;
> +    /* Amount of pending bytes can be transferred in postcopy */
> +    uint64_t postcopy_bytes;
> +} MigPendingData;
> +
>   /**
>    * struct SaveVMHandlers: handler structure to finely control
>    * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +204,19 @@ typedef struct SaveVMHandlers {
>       bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>
>       /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>        *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).
>        *
>        * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
> +     * @exact: set true for an accurate (slow) query
>        */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending,
> +                               bool exact);
>
>       /**
>        * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..e4efd243f3 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>   #ifndef MIGRATION_SAVEVM_H
>   #define MIGRATION_SAVEVM_H
>
> +#include "migration/register.h"
> +
>   #define QEMU_VM_FILE_MAGIC           0x5145564d
>   #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>   #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>   void qemu_savevm_state_cleanup(void);
>   void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>   int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
>   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                        uint64_t *can_postcopy);
>   void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..a22469a9e9 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,15 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       return 0;
>   }
>
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending,
> +                               bool exact)
>   {
>       S390StAttribState *sas = S390_STATTRIB(opaque);
>       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>       long long res = sac->get_dirtycount(sas);
>
>       if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>       }
>   }
>
> @@ -340,8 +340,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>       .save_setup = cmma_save_setup,
>       .save_live_iterate = cmma_save_iterate,
>       .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>       .save_cleanup = cmma_save_cleanup,
>       .load_state = cmma_load,
>       .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 5d5fca09bd..1e999f0040 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -571,42 +571,39 @@ static void vfio_save_cleanup(void *opaque)
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
>
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>   {
> -    VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
>
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> -
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    vfio_query_stop_copy_size(vbasedev);
>
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>   }
>
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending,
> +                               bool exact)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (exact) {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
> +    } else {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
>       }
>
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);

Let's also log "exact", I think it can be useful.

With this and Juraj's comment fixed:
Reviewed-by: Avihai Horon <avihaih@nvidia.com>

>   }
>
>   static bool vfio_is_active_iterate(void *opaque)
> @@ -851,8 +848,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..15d417013c 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,8 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>       return 0;
>   }
>
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data,
> +                                       bool exact)
>   {
>       DBMSaveState *s = &((DBMState *)opaque)->save;
>       SaveBitmapState *dbms;
> @@ -788,7 +787,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>
>       trace_dirty_bitmap_state_pending(pending);
>
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>   }
>
>   /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1249,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>       .save_setup = dirty_bitmap_save_setup,
>       .save_complete = dirty_bitmap_save_complete,
>       .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>       .save_live_iterate = dirty_bitmap_save_iterate,
>       .is_active_iterate = dirty_bitmap_is_active_iterate,
>       .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..e5b7217bf5 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,18 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>       return qemu_fflush(f);
>   }
>
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending,
> +                              bool exact)
>   {
>       RAMState **temp = opaque;
>       RAMState *rs = *temp;
>       uint64_t remaining_size;
>
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is not needed either with: (1) a fast query, or (2) after
> +     * postcopy has started (no new dirty will generate anymore).
> +     */
> +    if (exact && !migration_in_postcopy()) {
>           bql_lock();
>           WITH_RCU_READ_LOCK_GUARD() {
>               migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3466,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>
>       if (migrate_postcopy_ram()) {
>           /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>       } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>       }
>   }
>
> @@ -4703,8 +4691,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>       .save_live_iterate = ram_save_iterate,
>       .save_complete = ram_save_complete,
>       .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>       .load_state = ram_load,
>       .save_cleanup = ram_save_cleanup,
>       .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..392d840955 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,44 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>       return qemu_fflush(f);
>   }
>
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>   {
>       SaveStateEntry *se;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
>
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>               continue;
>           }
>           if (!qemu_savevm_state_active(se)) {
>               continue;
>           }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending, exact);
>       }
>   }
>
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, false);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, true);
>
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>   }
>
>   void qemu_savevm_state_cleanup(void)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 846e3625c5..7cf5a9eb2d 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
> -vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> +vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>   vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-13  9:57   ` Avihai Horon
@ 2026-04-16 14:01     ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-16 14:01 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

On Mon, Apr 13, 2026 at 12:57:17PM +0300, Avihai Horon wrote:
> > @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
> >   * A ``load_setup`` function that sets the VFIO device on the destination in
> >     _RESUMING state.
> > 
> > -* A ``state_pending_estimate`` function that reports an estimate of the
> > -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> > -  device.
> > -
> > -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> > -  driver, which indicates the amount of data that the vendor driver has yet to
> > -  save for the VFIO device.
> > +* A ``save_query_pending`` function that reports the remaining pre-copy
> > +  data that the vendor driver has yet to save for the VFIO device.
> 
> Maybe drop "pre-copy" since we also report stopcopy bytes?

Sure.

[...]

> > +    pending->precopy_bytes += remain;
> > +
> > +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> > +                             migration->precopy_init_size,
> > +                             migration->precopy_dirty_size);
> 
> Let's also log "exact", I think it can be useful.

Sure.

> 
> With this and Juraj's comment fixed:
> Reviewed-by: Avihai Horon <avihaih@nvidia.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-08 16:55 ` [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
  2026-04-09 17:10   ` Juraj Marcin
  2026-04-13  9:57   ` Avihai Horon
@ 2026-04-16 14:18   ` Jason J. Herne
  2 siblings, 0 replies; 52+ messages in thread
From: Jason J. Herne @ 2026-04-16 14:18 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Halil Pasic,
	Christian Borntraeger, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

 > ...
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..a22469a9e9 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,15 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       return 0;
>   }
>   
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending,
> +                               bool exact)
>   {
>       S390StAttribState *sas = S390_STATTRIB(opaque);
>       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>       long long res = sac->get_dirtycount(sas);
>   
>       if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>       }
>   }
>   
> @@ -340,8 +340,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>       .save_setup = cmma_save_setup,
>       .save_live_iterate = cmma_save_iterate,
>       .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>       .save_cleanup = cmma_save_cleanup,
>       .load_state = cmma_load,
>       .is_active = cmma_active,
LGTM,

Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 05/14] migration: Use the new save_query_pending() API directly
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (3 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-13  9:59   ` Avihai Horon
  2026-04-08 16:55 ` [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

It's easier to use the new API directly in the migration iterations.  This
also paves way for follow up patches to add new data to report directly to
the iterator function.

When at it, merge the tracepoints too into one.

No functional change intended.

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.h     |  4 ----
 migration/migration.c  | 16 +++++++---------
 migration/savevm.c     | 23 ++---------------------
 migration/trace-events |  3 +--
 4 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index e4efd243f3..96fdf96d4e 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -46,10 +46,6 @@ void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
 void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
-                                     uint64_t *can_postcopy);
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy);
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
 bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
 void qemu_savevm_state_end(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index dfc60372cf..68cfe2d3bf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3204,17 +3204,17 @@ typedef enum {
  */
 static MigIterateState migration_iteration_run(MigrationState *s)
 {
-    uint64_t must_precopy, can_postcopy, pending_size;
     Error *local_err = NULL;
     bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
+    MigPendingData pending = { };
+    uint64_t pending_size;
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
-    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
-    pending_size = must_precopy + can_postcopy;
-    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
+    qemu_savevm_query_pending(&pending, false);
+    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
 
     if (in_postcopy) {
         /*
@@ -3243,14 +3243,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * during postcopy phase.
          */
         if (pending_size <= s->threshold_size) {
-            qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
-            pending_size = must_precopy + can_postcopy;
-            trace_migrate_pending_exact(pending_size, must_precopy,
-                                        can_postcopy);
+            qemu_savevm_query_pending(&pending, true);
+            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
         }
 
         /* Should we switch to postcopy now? */
-        if (must_precopy <= s->threshold_size &&
+        if (pending.precopy_bytes <= s->threshold_size &&
             can_switchover && qatomic_read(&s->start_postcopy)) {
             if (postcopy_start(s, &local_err)) {
                 migrate_error_propagate(s, error_copy(local_err));
diff --git a/migration/savevm.c b/migration/savevm.c
index 392d840955..397f602257 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1778,28 +1778,9 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
         }
         se->ops->save_query_pending(se->opaque, pending, exact);
     }
-}
-
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
-{
-    MigPendingData pending;
-
-    qemu_savevm_query_pending(&pending, false);
-
-    *must_precopy = pending.precopy_bytes;
-    *can_postcopy = pending.postcopy_bytes;
-}
-
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
-                                     uint64_t *can_postcopy)
-{
-    MigPendingData pending;
-
-    qemu_savevm_query_pending(&pending, true);
 
-    *must_precopy = pending.precopy_bytes;
-    *can_postcopy = pending.postcopy_bytes;
+    trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
+                                    pending->postcopy_bytes);
 }
 
 void qemu_savevm_state_cleanup(void)
diff --git a/migration/trace-events b/migration/trace-events
index 60e5087e38..f8995b8d0d 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
+qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", postcopy=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
@@ -159,8 +160,6 @@ migration_cleanup(void) ""
 migrate_error(const char *error_desc) "error=%s"
 migration_cancel(void) ""
 migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at 0x%zx len 0x%zx"
-migrate_pending_exact(uint64_t size, uint64_t pre, uint64_t post) "exact pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
-migrate_pending_estimate(uint64_t size, uint64_t pre, uint64_t post) "estimate pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
 migrate_send_rp_message(int msg_type, uint16_t len) "%d: len %d"
 migrate_send_rp_recv_bitmap(char *name, int64_t size) "block '%s' size 0x%"PRIi64
 migration_completion_file_err(void) ""
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/14] migration: Use the new save_query_pending() API directly
  2026-04-08 16:55 ` [PATCH 05/14] migration: Use the new save_query_pending() API directly Peter Xu
@ 2026-04-13  9:59   ` Avihai Horon
  0 siblings, 0 replies; 52+ messages in thread
From: Avihai Horon @ 2026-04-13  9:59 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson


On 4/8/2026 7:55 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> It's easier to use the new API directly in the migration iterations.  This
> also paves way for follow up patches to add new data to report directly to
> the iterator function.
>
> When at it, merge the tracepoints too into one.
>
> No functional change intended.
>
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   migration/savevm.h     |  4 ----
>   migration/migration.c  | 16 +++++++---------
>   migration/savevm.c     | 23 ++---------------------
>   migration/trace-events |  3 +--
>   4 files changed, 10 insertions(+), 36 deletions(-)

Reviewed-by: Avihai Horon <avihaih@nvidia.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (4 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 05/14] migration: Use the new save_query_pending() API directly Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:13   ` Juraj Marcin
                     ` (2 more replies)
  2026-04-08 16:55 ` [PATCH 07/14] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
                   ` (9 subsequent siblings)
  15 siblings, 3 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

Allow modules to report data that can only be migrated after VM is stopped.

When this concept is introduced, we will need to account stopcopy size to
be part of pending_size as before.

However, when there're data only can be migrated in stopcopy phase, it
means the old "pending_size" may not always be able to reach low enough to
kickoff an slow version of query sync.

It used to be almost guaranteed to happen as all prior iterative modules
doesn't have stopcopy only data.  VFIO may change that fact by having some
data that must be copied during stop phase.

So we need to make sure QEMU will kickoff a synchronized version of query
pending when all precopy data is migrated.  This might be important to VFIO
to keep making progress even if the downtime cannot yet be satisfied.

So far, this patch should introduce no functional change, as no module yet
report stopcopy size.

This paves way for VFIO to properly report its pending data sizes, which
will start to include stop-only data.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/register.h |  7 +++++
 migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
 migration/savevm.c           |  7 +++--
 migration/trace-events       |  2 +-
 4 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index aba3c9af2f..e822a2a59f 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -21,6 +21,13 @@ typedef struct MigPendingData {
     uint64_t precopy_bytes;
     /* Amount of pending bytes can be transferred in postcopy */
     uint64_t postcopy_bytes;
+    /* Amount of pending bytes can be transferred only in stopcopy */
+    uint64_t stopcopy_bytes;
+    /*
+     * Total pending data, modules do not need to update this field, it
+     * will be automatically calculated by migration core API.
+     */
+    uint64_t total_bytes;
 } MigPendingData;
 
 /**
diff --git a/migration/migration.c b/migration/migration.c
index 68cfe2d3bf..bb17bd0e68 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3198,6 +3198,44 @@ typedef enum {
     MIG_ITERATE_BREAK,          /* Break the loop */
 } MigIterateState;
 
+/* Are we ready to move to the next iteration phase? */
+static bool migration_iteration_next_ready(MigrationState *s,
+                                           MigPendingData *pending)
+{
+    /*
+     * If the estimated values already suggest us to switchover, mark this
+     * iteration finished, time to do a slow sync.
+     */
+    if (pending->total_bytes <= s->threshold_size) {
+        return true;
+    }
+
+    /*
+     * Since we may have modules reporting stop-only data, we also want to
+     * re-query with slow mode if all precopy data is moved over.  This
+     * will also mark the current iteration done.
+     *
+     * This could happen when e.g. a module (like, VFIO) reports stopcopy
+     * size too large so it will never yet satisfy the downtime with the
+     * current setup (above check).  Here, slow version of re-query helps
+     * because we keep trying the best to move whatever we have.
+     */
+    if (pending->precopy_bytes == 0) {
+        return true;
+    }
+
+    return false;
+}
+
+static void migration_iteration_go_next(MigPendingData *pending)
+{
+    /*
+     * Do a slow sync will achieve this.  TODO: move RAM iteration code
+     * into the core layer.
+     */
+    qemu_savevm_query_pending(pending, true);
+}
+
 /*
  * Return true if continue to the next iteration directly, false
  * otherwise.
@@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
     MigPendingData pending = { };
-    uint64_t pending_size;
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
     qemu_savevm_query_pending(&pending, false);
-    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
 
     if (in_postcopy) {
         /*
@@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy completion doesn't rely on can_switchover, because when
          * POSTCOPY_ACTIVE it means switchover already happened.
          */
-        complete_ready = !pending_size;
+        complete_ready = !pending.total_bytes;
         if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
             (s->postcopy_package_loaded || complete_ready)) {
             /*
@@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size <= s->threshold_size) {
-            qemu_savevm_query_pending(&pending, true);
-            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
+        if (migration_iteration_next_ready(s, &pending)) {
+            migration_iteration_go_next(&pending);
         }
 
         /* Should we switch to postcopy now? */
@@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * (2) Pending size is no more than the threshold specified
          *     (which was calculated from expected downtime)
          */
-        complete_ready = can_switchover && (pending_size <= s->threshold_size);
+        complete_ready = can_switchover &&
+            (pending.total_bytes <= s->threshold_size);
     }
 
     if (complete_ready) {
-        trace_migration_thread_low_pending(pending_size);
+        trace_migration_thread_low_pending(pending.total_bytes);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }
diff --git a/migration/savevm.c b/migration/savevm.c
index 397f602257..b75c311a95 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
 {
     SaveStateEntry *se;
 
-    pending->precopy_bytes = 0;
-    pending->postcopy_bytes = 0;
+    memset(pending, 0, sizeof(*pending));
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops || !se->ops->save_query_pending) {
@@ -1779,7 +1778,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
         se->ops->save_query_pending(se->opaque, pending, exact);
     }
 
+    pending->total_bytes = pending->precopy_bytes +
+        pending->stopcopy_bytes + pending->postcopy_bytes;
+
     trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
+                                    pending->stopcopy_bytes,
                                     pending->postcopy_bytes);
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index f8995b8d0d..2f86ad448e 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
-qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", postcopy=%"PRIu64
+qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-08 16:55 ` [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
@ 2026-04-09 17:13   ` Juraj Marcin
  2026-04-09 17:36   ` Juraj Marcin
  2026-04-13 10:34   ` Avihai Horon
  2 siblings, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On 2026-04-08 12:55, Peter Xu wrote:
> Allow modules to report data that can only be migrated after VM is stopped.
> 
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
> 
> However, when there're data only can be migrated in stopcopy phase, it
> means the old "pending_size" may not always be able to reach low enough to
> kickoff an slow version of query sync.
> 
> It used to be almost guaranteed to happen as all prior iterative modules
> doesn't have stopcopy only data.  VFIO may change that fact by having some
> data that must be copied during stop phase.
> 
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated.  This might be important to VFIO
> to keep making progress even if the downtime cannot yet be satisfied.
> 
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
> 
> This paves way for VFIO to properly report its pending data sizes, which
> will start to include stop-only data.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/migration/register.h |  7 +++++
>  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>  migration/savevm.c           |  7 +++--
>  migration/trace-events       |  2 +-
>  4 files changed, 57 insertions(+), 11 deletions(-)
> 

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-08 16:55 ` [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
  2026-04-09 17:13   ` Juraj Marcin
@ 2026-04-09 17:36   ` Juraj Marcin
  2026-04-16 17:20     ` Peter Xu
  2026-04-13 10:34   ` Avihai Horon
  2 siblings, 1 reply; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

Hi Peter,

actually, I do have one question, see inline

On 2026-04-08 12:55, Peter Xu wrote:
> Allow modules to report data that can only be migrated after VM is stopped.
> 
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
> 
> However, when there're data only can be migrated in stopcopy phase, it
> means the old "pending_size" may not always be able to reach low enough to
> kickoff an slow version of query sync.
> 
> It used to be almost guaranteed to happen as all prior iterative modules
> doesn't have stopcopy only data.  VFIO may change that fact by having some
> data that must be copied during stop phase.
> 
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated.  This might be important to VFIO
> to keep making progress even if the downtime cannot yet be satisfied.
> 
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
> 
> This paves way for VFIO to properly report its pending data sizes, which
> will start to include stop-only data.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/migration/register.h |  7 +++++
>  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>  migration/savevm.c           |  7 +++--
>  migration/trace-events       |  2 +-
>  4 files changed, 57 insertions(+), 11 deletions(-)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index aba3c9af2f..e822a2a59f 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -21,6 +21,13 @@ typedef struct MigPendingData {
>      uint64_t precopy_bytes;
>      /* Amount of pending bytes can be transferred in postcopy */
>      uint64_t postcopy_bytes;
> +    /* Amount of pending bytes can be transferred only in stopcopy */
> +    uint64_t stopcopy_bytes;
> +    /*
> +     * Total pending data, modules do not need to update this field, it
> +     * will be automatically calculated by migration core API.
> +     */
> +    uint64_t total_bytes;
>  } MigPendingData;
>  
>  /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 68cfe2d3bf..bb17bd0e68 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>      MIG_ITERATE_BREAK,          /* Break the loop */
>  } MigIterateState;
>  
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, true);
> +}
> +
>  /*
>   * Return true if continue to the next iteration directly, false
>   * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
>      MigPendingData pending = { };
> -    uint64_t pending_size;
>      bool complete_ready;
>  
>      /* Fast path - get the estimated amount of pending data */
>      qemu_savevm_query_pending(&pending, false);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>  
>      if (in_postcopy) {
>          /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy completion doesn't rely on can_switchover, because when
>           * POSTCOPY_ACTIVE it means switchover already happened.
>           */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>          if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>              (s->postcopy_package_loaded || complete_ready)) {
>              /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, true);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>          }
>  
>          /* Should we switch to postcopy now? */
> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * (2) Pending size is no more than the threshold specified
>           *     (which was calculated from expected downtime)
>           */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);

shouldn't also the condition that triggers postcopy migration be updated?
As total_bytes is calculated as sum of all three
(precopy_bytes + stopcopy_bytes + postcopy_bytes), this implies to me
that stopcopy_bytes is not subset of precopy_bytes and would also need
to be migrated during switchover before postcopy.

Once this is resolved, then my Reviewed-by tag is valid, the patch looks
good to me otherwise.

Thanks!

>      }
>  
>      if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 397f602257..b75c311a95 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>  {
>      SaveStateEntry *se;
>  
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>  
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>          if (!se->ops || !se->ops->save_query_pending) {
> @@ -1779,7 +1778,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>          se->ops->save_query_pending(se->opaque, pending, exact);
>      }
>  
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>      trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                      pending->postcopy_bytes);
>  }
>  
> diff --git a/migration/trace-events b/migration/trace-events
> index f8995b8d0d..2f86ad448e 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> -- 
> 2.53.0
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-09 17:36   ` Juraj Marcin
@ 2026-04-16 17:20     ` Peter Xu
  2026-04-17 10:18       ` Juraj Marcin
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-16 17:20 UTC (permalink / raw)
  To: Juraj Marcin, Avihai Horon
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On Thu, Apr 09, 2026 at 07:36:51PM +0200, Juraj Marcin wrote:
> Hi Peter,
> 
> actually, I do have one question, see inline

[...]

> shouldn't also the condition that triggers postcopy migration be updated?
> As total_bytes is calculated as sum of all three
> (precopy_bytes + stopcopy_bytes + postcopy_bytes), this implies to me
> that stopcopy_bytes is not subset of precopy_bytes and would also need
> to be migrated during switchover before postcopy.

For now it shouldn't matter when VFIO never works with postcpoy yet, but
it's a good point. We'd better make it right from the start.

When looking at this, I also found we may be reporting wrong things in the
query results when pmem is available on postcopy bits, it's about when this
hits:

static bool ram_has_postcopy(void *opaque)
{
    RAMBlock *rb;
    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
        if (ram_block_is_pmem(rb)) {
            info_report("Block: %s, host: %p is a nvdimm memory, postcopy"
                         "is not supported now!", rb->idstr, rb->host);
            return false;
        }
    }

    return migrate_postcopy_ram();
}

So I think we should also report differently based on whether pmem is
present in ramblocks.. IOW, I think the module should make sure its
save_query_pending() to match its has_postcopy() when they're both present.
Or, maybe we don't even need has_postcopy()..

If it's a problem, it should be an old problem.  Let me address the
comments so far on this patch, so a fixup planned to be squashed (I also
added the trace parameter Avihai requested), feel free to comment before I
repost, thanks.

From 594b85b66b2d1abd9a38fae4051e01ffc73aa8ff Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 16 Apr 2026 13:09:16 -0400
Subject: [PATCH] fixup! migration: Introduce stopcopy_bytes in
 save_query_pending()

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c  | 13 +++++++++++--
 migration/savevm.c     |  3 ++-
 migration/trace-events |  2 +-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index c2aa145106..62299ff3c0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3276,6 +3276,16 @@ static void migration_iteration_go_next(MigPendingData *pending)
     }
 }
 
+static bool postcopy_should_start(MigrationState *s, MigPendingData *pending)
+{
+    /* If postcopy's switchver will violate user specified downtime, stop */
+    if (pending->precopy_bytes + pending->stopcopy_bytes > s->threshold_size) {
+        return false;
+    }
+
+    return qatomic_read(&s->start_postcopy);
+}
+
 /*
  * Return true if continue to the next iteration directly, false
  * otherwise.
@@ -3323,8 +3333,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
         }
 
         /* Should we switch to postcopy now? */
-        if (pending.precopy_bytes <= s->threshold_size &&
-            can_switchover && qatomic_read(&s->start_postcopy)) {
+        if (can_switchover && postcopy_should_start(s, &pending)) {
             if (postcopy_start(s, &local_err)) {
                 migrate_error_propagate(s, error_copy(local_err));
                 error_report_err(local_err);
diff --git a/migration/savevm.c b/migration/savevm.c
index 1d3fce45b9..7f38be0ee1 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1804,7 +1804,8 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
 
     trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
                                     pending->stopcopy_bytes,
-                                    pending->postcopy_bytes);
+                                    pending->postcopy_bytes,
+                                    pending->total_bytes);
 }
 
 void qemu_savevm_state_cleanup(MigrationState *s)
diff --git a/migration/trace-events b/migration/trace-events
index 2f86ad448e..d2134af862 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
-qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
+qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.53.0


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-16 17:20     ` Peter Xu
@ 2026-04-17 10:18       ` Juraj Marcin
  0 siblings, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-17 10:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Avihai Horon, qemu-devel, Maciej S . Szmigiero,
	Daniel P . Berrangé, Zhiyi Guo, Prasad Pandit,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

On 2026-04-16 13:20, Peter Xu wrote:
> On Thu, Apr 09, 2026 at 07:36:51PM +0200, Juraj Marcin wrote:
> > Hi Peter,
> > 
> > actually, I do have one question, see inline
> 
> [...]
> 
> > shouldn't also the condition that triggers postcopy migration be updated?
> > As total_bytes is calculated as sum of all three
> > (precopy_bytes + stopcopy_bytes + postcopy_bytes), this implies to me
> > that stopcopy_bytes is not subset of precopy_bytes and would also need
> > to be migrated during switchover before postcopy.
> 
> For now it shouldn't matter when VFIO never works with postcpoy yet, but
> it's a good point. We'd better make it right from the start.
> 
> When looking at this, I also found we may be reporting wrong things in the
> query results when pmem is available on postcopy bits, it's about when this
> hits:
> 
> static bool ram_has_postcopy(void *opaque)
> {
>     RAMBlock *rb;
>     RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
>         if (ram_block_is_pmem(rb)) {
>             info_report("Block: %s, host: %p is a nvdimm memory, postcopy"
>                          "is not supported now!", rb->idstr, rb->host);
>             return false;
>         }
>     }
> 
>     return migrate_postcopy_ram();
> }
> 
> So I think we should also report differently based on whether pmem is
> present in ramblocks.. IOW, I think the module should make sure its
> save_query_pending() to match its has_postcopy() when they're both present.
> Or, maybe we don't even need has_postcopy()..

Yeah, it looks like ram_state_pending() should use ram_has_postcopy()
instead of just migrate_postcopy_ram().

> 
> If it's a problem, it should be an old problem.  Let me address the
> comments so far on this patch, so a fixup planned to be squashed (I also
> added the trace parameter Avihai requested), feel free to comment before I
> repost, thanks.
> 
> From 594b85b66b2d1abd9a38fae4051e01ffc73aa8ff Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 16 Apr 2026 13:09:16 -0400
> Subject: [PATCH] fixup! migration: Introduce stopcopy_bytes in
>  save_query_pending()
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c  | 13 +++++++++++--
>  migration/savevm.c     |  3 ++-
>  migration/trace-events |  2 +-
>  3 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index c2aa145106..62299ff3c0 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3276,6 +3276,16 @@ static void migration_iteration_go_next(MigPendingData *pending)
>      }
>  }
>  
> +static bool postcopy_should_start(MigrationState *s, MigPendingData *pending)
> +{
> +    /* If postcopy's switchver will violate user specified downtime, stop */
> +    if (pending->precopy_bytes + pending->stopcopy_bytes > s->threshold_size) {
> +        return false;
> +    }
> +
> +    return qatomic_read(&s->start_postcopy);
> +}
> +
>  /*
>   * Return true if continue to the next iteration directly, false
>   * otherwise.
> @@ -3323,8 +3333,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>          }
>  
>          /* Should we switch to postcopy now? */
> -        if (pending.precopy_bytes <= s->threshold_size &&
> -            can_switchover && qatomic_read(&s->start_postcopy)) {
> +        if (can_switchover && postcopy_should_start(s, &pending)) {
>              if (postcopy_start(s, &local_err)) {
>                  migrate_error_propagate(s, error_copy(local_err));
>                  error_report_err(local_err);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 1d3fce45b9..7f38be0ee1 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1804,7 +1804,8 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>  
>      trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
>                                      pending->stopcopy_bytes,
> -                                    pending->postcopy_bytes);
> +                                    pending->postcopy_bytes,
> +                                    pending->total_bytes);
>  }
>  
>  void qemu_savevm_state_cleanup(MigrationState *s)
> diff --git a/migration/trace-events b/migration/trace-events
> index 2f86ad448e..d2134af862 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy, uint64_t total) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64", total=%"PRIu64
>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> -- 
> 2.53.0

The fixup looks good, thanks!

> 
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-08 16:55 ` [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
  2026-04-09 17:13   ` Juraj Marcin
  2026-04-09 17:36   ` Juraj Marcin
@ 2026-04-13 10:34   ` Avihai Horon
  2 siblings, 0 replies; 52+ messages in thread
From: Avihai Horon @ 2026-04-13 10:34 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson


On 4/8/2026 7:55 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Allow modules to report data that can only be migrated after VM is stopped.
>
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
>
> However, when there're data only can be migrated in stopcopy phase, it
> means the old "pending_size" may not always be able to reach low enough to
> kickoff an slow version of query sync.
>
> It used to be almost guaranteed to happen as all prior iterative modules
> doesn't have stopcopy only data.  VFIO may change that fact by having some
> data that must be copied during stop phase.
>
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated.  This might be important to VFIO
> to keep making progress even if the downtime cannot yet be satisfied.
>
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
>
> This paves way for VFIO to properly report its pending data sizes, which
> will start to include stop-only data.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/migration/register.h |  7 +++++
>   migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>   migration/savevm.c           |  7 +++--
>   migration/trace-events       |  2 +-
>   4 files changed, 57 insertions(+), 11 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index aba3c9af2f..e822a2a59f 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -21,6 +21,13 @@ typedef struct MigPendingData {
>       uint64_t precopy_bytes;
>       /* Amount of pending bytes can be transferred in postcopy */
>       uint64_t postcopy_bytes;
> +    /* Amount of pending bytes can be transferred only in stopcopy */
> +    uint64_t stopcopy_bytes;
> +    /*
> +     * Total pending data, modules do not need to update this field, it
> +     * will be automatically calculated by migration core API.
> +     */
> +    uint64_t total_bytes;
>   } MigPendingData;
>
>   /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 68cfe2d3bf..bb17bd0e68 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>       MIG_ITERATE_BREAK,          /* Break the loop */
>   } MigIterateState;
>
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, true);
> +}
> +
>   /*
>    * Return true if continue to the next iteration directly, false
>    * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                           s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>       bool can_switchover = migration_can_switchover(s);
>       MigPendingData pending = { };
> -    uint64_t pending_size;
>       bool complete_ready;
>
>       /* Fast path - get the estimated amount of pending data */
>       qemu_savevm_query_pending(&pending, false);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>
>       if (in_postcopy) {
>           /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * postcopy completion doesn't rely on can_switchover, because when
>            * POSTCOPY_ACTIVE it means switchover already happened.
>            */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>           if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>               (s->postcopy_package_loaded || complete_ready)) {
>               /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * postcopy started, so ESTIMATE should always match with EXACT
>            * during postcopy phase.
>            */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, true);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>           }
>
>           /* Should we switch to postcopy now? */

I agree with Juraj, postcopy condition should also consider stopcopy_bytes.

> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * (2) Pending size is no more than the threshold specified
>            *     (which was calculated from expected downtime)
>            */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);
>       }
>
>       if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>           migration_completion(s);
>           return MIG_ITERATE_BREAK;
>       }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 397f602257..b75c311a95 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>   {
>       SaveStateEntry *se;
>
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>           if (!se->ops || !se->ops->save_query_pending) {
> @@ -1779,7 +1778,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>           se->ops->save_query_pending(se->opaque, pending, exact);
>       }
>
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>       trace_qemu_savevm_query_pending(exact, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                       pending->postcopy_bytes);

Would it be convenient to also log pending->total_bytes?

Other than that:
Reviewed-by: Avihai Horon <avihaih@nvidia.com>

>   }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index f8995b8d0d..2f86ad448e 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>   qemu_loadvm_state_post_main(int ret) "%d"
>   qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>   qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool exact, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "exact=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>   loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>   loadvm_state_setup(void) ""
>   loadvm_state_cleanup(void) ""
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 07/14] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (5 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-13 10:56   ` Avihai Horon
  2026-04-08 16:55 ` [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime Peter Xu
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

VFIO reports different things in its fast/slow version of query pending
results.  It was because it wants to make sure precopy data can reach 0,
which is needed to make sure sync queries will happen periodically over
time.

Now with stopcopy size reporting facility it doesn't need this hack
anymore.  Fix this by reporting the same values in fast/slow versions of
query pending request, except that the slow version will do a slow sync
with the hardwares.

When at it, removing the special casing for vfio_device_state_is_precopy()
which may reporting nothing in a fast query.  Then ther reporting will be
consistent to VFIO devices that do not support precopy phase.

Copy stable might be too much; just skip it and skip the Fixes.

Cc: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/migration.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 1e999f0040..57e88c9dcf 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -587,19 +587,23 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending,
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t remain;
+    uint64_t precopy_size, stopcopy_size;
 
     if (exact) {
         vfio_state_pending_sync(vbasedev);
-        remain = migration->stopcopy_size;
+    }
+
+    precopy_size =
+        migration->precopy_init_size + migration->precopy_dirty_size;
+
+    if (migration->stopcopy_size > precopy_size) {
+        stopcopy_size = migration->stopcopy_size - precopy_size;
     } else {
-        if (!vfio_device_state_is_precopy(vbasedev)) {
-            return;
-        }
-        remain = migration->precopy_init_size + migration->precopy_dirty_size;
+        stopcopy_size = 0;
     }
 
-    pending->precopy_bytes += remain;
+    pending->precopy_bytes += precopy_size;
+    pending->stopcopy_bytes += stopcopy_size;
 
     trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
                              migration->precopy_init_size,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 07/14] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-04-08 16:55 ` [PATCH 07/14] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
@ 2026-04-13 10:56   ` Avihai Horon
  0 siblings, 0 replies; 52+ messages in thread
From: Avihai Horon @ 2026-04-13 10:56 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson


On 4/8/2026 7:55 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> VFIO reports different things in its fast/slow version of query pending
> results.  It was because it wants to make sure precopy data can reach 0,
> which is needed to make sure sync queries will happen periodically over
> time.
>
> Now with stopcopy size reporting facility it doesn't need this hack
> anymore.  Fix this by reporting the same values in fast/slow versions of
> query pending request, except that the slow version will do a slow sync
> with the hardwares.
>
> When at it, removing the special casing for vfio_device_state_is_precopy()
> which may reporting nothing in a fast query.  Then ther reporting will be
> consistent to VFIO devices that do not support precopy phase.
>
> Copy stable might be too much; just skip it and skip the Fixes.
>
> Cc: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/migration.c | 18 +++++++++++-------
>   1 file changed, 11 insertions(+), 7 deletions(-)

Reviewed-by: Avihai Horon <avihaih@nvidia.com>

I also tested it with mlx5 VFIO device, similar to your test scenario 
(and some other variations of this test):
mlx5 VFIO device with 4GB stopcopy size, configured 
avail-switchover-bandwidth, downtime_limit too low, increased 
downtime_limit, migration converged successfully -- exp_down/Remaining 
values seem correct.

So feel free to also add:

Tested-by: Avihai Horon <avihaih@nvidia.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (6 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 07/14] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:15   ` Juraj Marcin
  2026-04-08 16:55 ` [PATCH 09/14] migration: Move iteration counter out of RAM Peter Xu
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

After qemu_savevm_query_pending() be exposed to more code paths, it can be
used at very early stage when migration started and this may expose some
race conditions that we don't use to have.  This patch make it prepared
for such use cases so this API is fine to be used almost anytime.

What matters here is, querying pending for each module normally depends on
save_setup() being run first, otherwise modules may not be ready for the
query request.

Consider an early cancellation of migration after SETUP status but before
invocations of save_setup() hooks, source QEMU may fall into CANCELLING
stage directly from SETUP (not ACTIVE, which is the normal use case), in
which case save_setup() may not have been invoked and modules are not
ready.  However qemu_savevm_query_pending() may still be used in QMP
commands like query-migrate and causing crashes.

Guard such use case by introducing a boolean reflecting the availability of
vmstate save handlers on correct completions of save_setup()s.  So far,
only protect qemu_savevm_query_pending() with it.  Logically other hooks
face similar concern, but most of them shouldn't be reachable from random
code path except migration thread so it should be fine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  8 ++++++++
 migration/savevm.h    |  2 +-
 migration/migration.c |  2 +-
 migration/savevm.c    | 37 +++++++++++++++++++++++++++++++++----
 4 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index b6888daced..e504df6915 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -522,6 +522,14 @@ struct MigrationState {
      * anything as input.
      */
     bool has_block_bitmap_mapping;
+
+    /*
+     * This boolean reflects if the vmstate handlers have been properly
+     * setup on source side.  It is set after vmstate save_setup() hooks
+     * are successfully invoked, and cleared after save_cleanup()s.  It
+     * reflects a general availability of vmstate hooks on the source side.
+     */
+    bool save_setup_ready;
 };
 
 void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
diff --git a/migration/savevm.h b/migration/savevm.h
index 96fdf96d4e..04ed09cec2 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -42,7 +42,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s);
 void qemu_savevm_send_header(QEMUFile *f);
 void qemu_savevm_state_header(QEMUFile *f);
 int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
-void qemu_savevm_state_cleanup(void);
+void qemu_savevm_state_cleanup(MigrationState *s);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
 void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
diff --git a/migration/migration.c b/migration/migration.c
index bb17bd0e68..a9ee3360e1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1283,7 +1283,7 @@ static void migration_cleanup(MigrationState *s)
     g_free(s->hostname);
     s->hostname = NULL;
 
-    qemu_savevm_state_cleanup();
+    qemu_savevm_state_cleanup(s);
     cpr_state_close();
     cpr_transfer_source_destroy(s);
 
diff --git a/migration/savevm.c b/migration/savevm.c
index b75c311a95..1d3fce45b9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1387,7 +1387,8 @@ int qemu_savevm_state_non_iterable_early(QEMUFile *f,
     return 0;
 }
 
-static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
+static int qemu_savevm_state_setup(MigrationState *s, QEMUFile *f,
+                                   Error **errp)
 {
     SaveStateEntry *se;
     int ret;
@@ -1409,6 +1410,13 @@ static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
         }
     }
 
+    /*
+     * Logically, it should be paired with any hook being used who needs to
+     * load_acquire() the flag first.  So far, only save_query_pending()
+     * uses it.
+     */
+    qatomic_store_release(&s->save_setup_ready, true);
+
     return 0;
 }
 
@@ -1429,7 +1437,7 @@ int qemu_savevm_state_do_setup(QEMUFile *f, Error **errp)
         return ret;
     }
 
-    ret = qemu_savevm_state_setup(f, errp);
+    ret = qemu_savevm_state_setup(ms, f, errp);
     if (ret) {
         return ret;
     }
@@ -1764,10 +1772,23 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
 
 void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
 {
+    MigrationState *s = migrate_get_current();
     SaveStateEntry *se;
 
     memset(pending, 0, sizeof(*pending));
 
+    /*
+     * This API can be invoked very early before SETUP is properly done, in
+     * that case don't invoke module queries because they're not ready.
+     * Just report all zeros.
+     *
+     * This is paired with save_setup_ready updates on save_setup() and
+     * save_cleanup().
+     */
+    if (!s || !qatomic_load_acquire(&s->save_setup_ready)) {
+        return;
+    }
+
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (!se->ops || !se->ops->save_query_pending) {
             continue;
@@ -1786,7 +1807,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
                                     pending->postcopy_bytes);
 }
 
-void qemu_savevm_state_cleanup(void)
+void qemu_savevm_state_cleanup(MigrationState *s)
 {
     SaveStateEntry *se;
     Error *local_err = NULL;
@@ -1795,6 +1816,14 @@ void qemu_savevm_state_cleanup(void)
         error_report_err(local_err);
     }
 
+    s->save_setup_ready = false;
+    /*
+     * Make sure we clear the flag before invoking save_cleanup(), so any
+     * racy QMP query-migrate won't try to invoke any save hooks.  Just use
+     * an explicit barrier to be simple.
+     */
+    smp_mb();
+
     trace_savevm_state_cleanup();
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
         if (se->ops && se->ops->save_cleanup) {
@@ -1841,7 +1870,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
         error_setg_errno(errp, -ret, "Error while writing VM state");
     }
 cleanup:
-    qemu_savevm_state_cleanup();
+    qemu_savevm_state_cleanup(ms);
 
     if (ret != 0) {
         status = MIGRATION_STATUS_FAILED;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime
  2026-04-08 16:55 ` [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime Peter Xu
@ 2026-04-09 17:15   ` Juraj Marcin
  2026-04-16 18:06     ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On 2026-04-08 12:55, Peter Xu wrote:
> After qemu_savevm_query_pending() be exposed to more code paths, it can be
> used at very early stage when migration started and this may expose some
> race conditions that we don't use to have.  This patch make it prepared
> for such use cases so this API is fine to be used almost anytime.
> 
> What matters here is, querying pending for each module normally depends on
> save_setup() being run first, otherwise modules may not be ready for the
> query request.
> 
> Consider an early cancellation of migration after SETUP status but before
> invocations of save_setup() hooks, source QEMU may fall into CANCELLING
> stage directly from SETUP (not ACTIVE, which is the normal use case), in
> which case save_setup() may not have been invoked and modules are not
> ready.  However qemu_savevm_query_pending() may still be used in QMP
> commands like query-migrate and causing crashes.
> 
> Guard such use case by introducing a boolean reflecting the availability of
> vmstate save handlers on correct completions of save_setup()s.  So far,
> only protect qemu_savevm_query_pending() with it.  Logically other hooks
> face similar concern, but most of them shouldn't be reachable from random
> code path except migration thread so it should be fine.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h |  8 ++++++++
>  migration/savevm.h    |  2 +-
>  migration/migration.c |  2 +-
>  migration/savevm.c    | 37 +++++++++++++++++++++++++++++++++----
>  4 files changed, 43 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index b6888daced..e504df6915 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -522,6 +522,14 @@ struct MigrationState {
>       * anything as input.
>       */
>      bool has_block_bitmap_mapping;
> +
> +    /*
> +     * This boolean reflects if the vmstate handlers have been properly
> +     * setup on source side.  It is set after vmstate save_setup() hooks
> +     * are successfully invoked, and cleared after save_cleanup()s.  It
> +     * reflects a general availability of vmstate hooks on the source side.
> +     */
> +    bool save_setup_ready;
>  };
>  
>  void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 96fdf96d4e..04ed09cec2 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -42,7 +42,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_send_header(QEMUFile *f);
>  void qemu_savevm_state_header(QEMUFile *f);
>  int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> -void qemu_savevm_state_cleanup(void);
> +void qemu_savevm_state_cleanup(MigrationState *s);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
>  void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
> diff --git a/migration/migration.c b/migration/migration.c
> index bb17bd0e68..a9ee3360e1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1283,7 +1283,7 @@ static void migration_cleanup(MigrationState *s)
>      g_free(s->hostname);
>      s->hostname = NULL;
>  
> -    qemu_savevm_state_cleanup();
> +    qemu_savevm_state_cleanup(s);
>      cpr_state_close();
>      cpr_transfer_source_destroy(s);
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b75c311a95..1d3fce45b9 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1387,7 +1387,8 @@ int qemu_savevm_state_non_iterable_early(QEMUFile *f,
>      return 0;
>  }
>  
> -static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
> +static int qemu_savevm_state_setup(MigrationState *s, QEMUFile *f,
> +                                   Error **errp)
>  {
>      SaveStateEntry *se;
>      int ret;
> @@ -1409,6 +1410,13 @@ static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
>          }
>      }
>  
> +    /*
> +     * Logically, it should be paired with any hook being used who needs to
> +     * load_acquire() the flag first.  So far, only save_query_pending()
> +     * uses it.
> +     */
> +    qatomic_store_release(&s->save_setup_ready, true);

What other savevm functions would benefit from this? Would it make sense
to include them in this patch/series?

> +
>      return 0;
>  }
>  
> @@ -1429,7 +1437,7 @@ int qemu_savevm_state_do_setup(QEMUFile *f, Error **errp)
>          return ret;
>      }
>  
> -    ret = qemu_savevm_state_setup(f, errp);
> +    ret = qemu_savevm_state_setup(ms, f, errp);
>      if (ret) {
>          return ret;
>      }
> @@ -1764,10 +1772,23 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>  
>  void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>  {
> +    MigrationState *s = migrate_get_current();
>      SaveStateEntry *se;
>  
>      memset(pending, 0, sizeof(*pending));
>  
> +    /*
> +     * This API can be invoked very early before SETUP is properly done, in
> +     * that case don't invoke module queries because they're not ready.
> +     * Just report all zeros.
> +     *
> +     * This is paired with save_setup_ready updates on save_setup() and
> +     * save_cleanup().
> +     */
> +    if (!s || !qatomic_load_acquire(&s->save_setup_ready)) {
> +        return;
> +    }
> +
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>          if (!se->ops || !se->ops->save_query_pending) {
>              continue;
> @@ -1786,7 +1807,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool exact)
>                                      pending->postcopy_bytes);
>  }
>  
> -void qemu_savevm_state_cleanup(void)
> +void qemu_savevm_state_cleanup(MigrationState *s)
>  {
>      SaveStateEntry *se;
>      Error *local_err = NULL;
> @@ -1795,6 +1816,14 @@ void qemu_savevm_state_cleanup(void)
>          error_report_err(local_err);
>      }
>  
> +    s->save_setup_ready = false;
> +    /*
> +     * Make sure we clear the flag before invoking save_cleanup(), so any
> +     * racy QMP query-migrate won't try to invoke any save hooks.  Just use
> +     * an explicit barrier to be simple.
> +     */
> +    smp_mb();
> +
>      trace_savevm_state_cleanup();
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>          if (se->ops && se->ops->save_cleanup) {
> @@ -1841,7 +1870,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
>          error_setg_errno(errp, -ret, "Error while writing VM state");
>      }
>  cleanup:
> -    qemu_savevm_state_cleanup();
> +    qemu_savevm_state_cleanup(ms);
>  
>      if (ret != 0) {
>          status = MIGRATION_STATUS_FAILED;
> -- 
> 2.53.0
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime
  2026-04-09 17:15   ` Juraj Marcin
@ 2026-04-16 18:06     ` Peter Xu
  2026-04-17 10:26       ` Juraj Marcin
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-16 18:06 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On Thu, Apr 09, 2026 at 07:15:28PM +0200, Juraj Marcin wrote:
> On 2026-04-08 12:55, Peter Xu wrote:
> > After qemu_savevm_query_pending() be exposed to more code paths, it can be
> > used at very early stage when migration started and this may expose some
> > race conditions that we don't use to have.  This patch make it prepared
> > for such use cases so this API is fine to be used almost anytime.
> > 
> > What matters here is, querying pending for each module normally depends on
> > save_setup() being run first, otherwise modules may not be ready for the
> > query request.
> > 
> > Consider an early cancellation of migration after SETUP status but before
> > invocations of save_setup() hooks, source QEMU may fall into CANCELLING
> > stage directly from SETUP (not ACTIVE, which is the normal use case), in
> > which case save_setup() may not have been invoked and modules are not
> > ready.  However qemu_savevm_query_pending() may still be used in QMP
> > commands like query-migrate and causing crashes.
> > 
> > Guard such use case by introducing a boolean reflecting the availability of
> > vmstate save handlers on correct completions of save_setup()s.  So far,
> > only protect qemu_savevm_query_pending() with it.  Logically other hooks
> > face similar concern, but most of them shouldn't be reachable from random
> > code path except migration thread so it should be fine.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.h |  8 ++++++++
> >  migration/savevm.h    |  2 +-
> >  migration/migration.c |  2 +-
> >  migration/savevm.c    | 37 +++++++++++++++++++++++++++++++++----
> >  4 files changed, 43 insertions(+), 6 deletions(-)
> > 
> > diff --git a/migration/migration.h b/migration/migration.h
> > index b6888daced..e504df6915 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -522,6 +522,14 @@ struct MigrationState {
> >       * anything as input.
> >       */
> >      bool has_block_bitmap_mapping;
> > +
> > +    /*
> > +     * This boolean reflects if the vmstate handlers have been properly
> > +     * setup on source side.  It is set after vmstate save_setup() hooks
> > +     * are successfully invoked, and cleared after save_cleanup()s.  It
> > +     * reflects a general availability of vmstate hooks on the source side.
> > +     */
> > +    bool save_setup_ready;
> >  };
> >  
> >  void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index 96fdf96d4e..04ed09cec2 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -42,7 +42,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s);
> >  void qemu_savevm_send_header(QEMUFile *f);
> >  void qemu_savevm_state_header(QEMUFile *f);
> >  int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> > -void qemu_savevm_state_cleanup(void);
> > +void qemu_savevm_state_cleanup(MigrationState *s);
> >  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
> >  int qemu_savevm_state_complete_precopy(MigrationState *s);
> >  void qemu_savevm_query_pending(MigPendingData *pending, bool exact);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index bb17bd0e68..a9ee3360e1 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1283,7 +1283,7 @@ static void migration_cleanup(MigrationState *s)
> >      g_free(s->hostname);
> >      s->hostname = NULL;
> >  
> > -    qemu_savevm_state_cleanup();
> > +    qemu_savevm_state_cleanup(s);
> >      cpr_state_close();
> >      cpr_transfer_source_destroy(s);
> >  
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index b75c311a95..1d3fce45b9 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1387,7 +1387,8 @@ int qemu_savevm_state_non_iterable_early(QEMUFile *f,
> >      return 0;
> >  }
> >  
> > -static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
> > +static int qemu_savevm_state_setup(MigrationState *s, QEMUFile *f,
> > +                                   Error **errp)
> >  {
> >      SaveStateEntry *se;
> >      int ret;
> > @@ -1409,6 +1410,13 @@ static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
> >          }
> >      }
> >  
> > +    /*
> > +     * Logically, it should be paired with any hook being used who needs to
> > +     * load_acquire() the flag first.  So far, only save_query_pending()
> > +     * uses it.
> > +     */
> > +    qatomic_store_release(&s->save_setup_ready, true);
> 
> What other savevm functions would benefit from this? Would it make sense
> to include them in this patch/series?

I don't yet know any other hook would need this.

AFAIU, the rational is most of the hooks should only be invoked in
migration_thread(), where it's always guaranteed that save_setup() is
always done before hand.

Here, when we make save_query_pending() to be more flexible to be able to
get invoked in QMP handlers, then it opens the races and only that hook is
special.

We actually have another option, which is to keep another cached variable
for global remaining data, then only update it in migration_thread().  Then
it will keep the rational as before.

But then that var will only be used for this purpose, and I thought it
might also be confusing: remember that we have per-module counter caches
already for querying the remaining data fields, like what RAM and VFIO do.

So I chose the current approach.  Let me know what you think.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime
  2026-04-16 18:06     ` Peter Xu
@ 2026-04-17 10:26       ` Juraj Marcin
  2026-04-20 15:56         ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Juraj Marcin @ 2026-04-17 10:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

Hi Peter,

On 2026-04-16 14:06, Peter Xu wrote:
> On Thu, Apr 09, 2026 at 07:15:28PM +0200, Juraj Marcin wrote:
> > On 2026-04-08 12:55, Peter Xu wrote:

...

> > > -static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
> > > +static int qemu_savevm_state_setup(MigrationState *s, QEMUFile *f,
> > > +                                   Error **errp)
> > >  {
> > >      SaveStateEntry *se;
> > >      int ret;
> > > @@ -1409,6 +1410,13 @@ static int qemu_savevm_state_setup(QEMUFile *f, Error **errp)
> > >          }
> > >      }
> > >  
> > > +    /*
> > > +     * Logically, it should be paired with any hook being used who needs to
> > > +     * load_acquire() the flag first.  So far, only save_query_pending()
> > > +     * uses it.
> > > +     */
> > > +    qatomic_store_release(&s->save_setup_ready, true);
> > 
> > What other savevm functions would benefit from this? Would it make sense
> > to include them in this patch/series?
> 
> I don't yet know any other hook would need this.
> 
> AFAIU, the rational is most of the hooks should only be invoked in
> migration_thread(), where it's always guaranteed that save_setup() is
> always done before hand.
> 
> Here, when we make save_query_pending() to be more flexible to be able to
> get invoked in QMP handlers, then it opens the races and only that hook is
> special.
> 
> We actually have another option, which is to keep another cached variable
> for global remaining data, then only update it in migration_thread().  Then
> it will keep the rational as before.
> 
> But then that var will only be used for this purpose, and I thought it
> might also be confusing: remember that we have per-module counter caches
> already for querying the remaining data fields, like what RAM and VFIO do.
> 
> So I chose the current approach.  Let me know what you think.

I agree with your approach, I was just wondering if there are other such
races that are not handled yet. I don't think anything needs to be
changed then, given that only save_query_pending() is susceptible to the
race.

Thanks!

> 
> Thanks,
> 
> -- 
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime
  2026-04-17 10:26       ` Juraj Marcin
@ 2026-04-20 15:56         ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-20 15:56 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On Fri, Apr 17, 2026 at 12:26:29PM +0200, Juraj Marcin wrote:
> I agree with your approach, I was just wondering if there are other such
> races that are not handled yet. I don't think anything needs to be
> changed then, given that only save_query_pending() is susceptible to the
> race.

Emmm, irrelevant of what we're discussing.. I overlooked something here.

I just notice save_query_pending() has lock implications, this may not work
to be exported directly to a QMP handler: dirty_bitmap_state_pending(), as
one example, may take BQL inside the query function.

I added some lines into current dirty-bitmap tests, and it can already
reproduce a crash (due to deadlocking on BQL):

diff --git a/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
index c519e6db8c..67d69d9d1e 100755
--- a/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
+++ b/tests/qemu-iotests/tests/migrate-bitmaps-postcopy-test
@@ -162,8 +162,14 @@ class TestDirtyBitmapPostcopyMigration(iotests.QMPTestCase):
 
         self.vm_a.cmd('migrate', uri='exec:cat>' + fifo)
 
+        # Verify query-migrate working with dirty-bitmaps in precopy mode
+        self.vm_a.qmp('query-migrate')
+
         self.vm_a.cmd('migrate-start-postcopy')
 
+        # Verify query-migrate working with dirty-bitmaps in postcopy mode
+        self.vm_a.qmp('query-migrate')
+
         event_resume = self.vm_b.event_wait('RESUME')
         self.vm_b_events.append(event_resume)
         return (event_resume, discards1_sha256, all_discards_sha256)

I'll need to fix it in another way, I'll think about it.  I'll likely add
above into a separate small patch for qemu iotests to cover this case when
repost.

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 09/14] migration: Move iteration counter out of RAM
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (7 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 22:14   ` Fabiano Rosas
  2026-04-08 16:55 ` [PATCH 10/14] migration: Introduce a helper to return switchover bw estimate Peter Xu
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson, Hyman Huang,
	Prasad Pandit

It used to hide in RAM dirty sync path.  Now with more modules being able
to slow sync on dirty information, keeping it there may not be good anymore
because it's not RAM's own concept for iterations: all modules should
follow.

More importantly, mgmt may try to query dirty info (to make policy
decisions like adjusting downtime) by listening to iteration count changes
via QMP events.  So we must make sure the boost of iterations only happens
_after_ the dirty sync operations with whatever form (RAM's dirty bitmap
sync, or VFIO's different ioctls to fetch latest dirty info from kernel).

Move this to core migration path to manage, together with the event
generation, so that it can be well ordered with the sync operations for all
modules.

This brings a good side effect that we should have an old issue regarding
to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
counts (because it invokes sync ops).  Now it won't, which is actually the
right behavior.

Said that, we have code (not only QEMU, but likely mgmt too) assuming the
1st iteration will always shows dirty count to 1.  Make it initialized with
1 this time, because we'll miss the dirty sync for setup() on boosting this
counter now.

Reviewed-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-stats.h |  3 ++-
 migration/migration.c       | 29 ++++++++++++++++++++++++++---
 migration/ram.c             |  6 ------
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 1153520f7a..326ddb0088 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -43,7 +43,8 @@ typedef struct {
      */
     uint64_t dirty_pages_rate;
     /*
-     * Number of times we have synchronized guest bitmaps.
+     * Number of times we have synchronized guest bitmaps.  This always
+     * starts from 1 for the 1st iteration.
      */
     uint64_t dirty_sync_count;
     /*
diff --git a/migration/migration.c b/migration/migration.c
index a9ee3360e1..57d91ad9e3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
     s->threshold_size = 0;
     s->switchover_acked = false;
     s->rdma_migration = false;
+
     /*
-     * set mig_stats memory to zero for a new migration
+     * set mig_stats memory to zero for a new migration.. except the
+     * iteration counter, which we want to make sure it returns 1 for the
+     * first iteration.
      */
     memset(&mig_stats, 0, sizeof(mig_stats));
+    mig_stats.dirty_sync_count = 1;
+
     migration_reset_vfio_bytes_transferred();
 
     s->postcopy_package_loaded = false;
@@ -3230,10 +3235,28 @@ static bool migration_iteration_next_ready(MigrationState *s,
 static void migration_iteration_go_next(MigPendingData *pending)
 {
     /*
-     * Do a slow sync will achieve this.  TODO: move RAM iteration code
-     * into the core layer.
+     * Do a slow sync first before boosting the iteration count.
      */
     qemu_savevm_query_pending(pending, true);
+
+    /*
+     * Boost dirty sync count to reflect we finished one iteration.
+     *
+     * NOTE: we need to make sure when this happens (together with the
+     * event sent below) all modules have slow-synced the pending data
+     * above.  That means a write mem barrier, but qatomic_add() should be
+     * enough.
+     *
+     * It's because a mgmt could wait on the iteration event to query again
+     * on pending data for policy changes (e.g. downtime adjustments).  The
+     * ordering will make sure the query will fetch the latest results from
+     * all the modules.
+     */
+    qatomic_add(&mig_stats.dirty_sync_count, 1);
+
+    if (migrate_events()) {
+        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
+    }
 }
 
 /*
diff --git a/migration/ram.c b/migration/ram.c
index e5b7217bf5..686162643d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
     RAMBlock *block;
     int64_t end_time;
 
-    qatomic_add(&mig_stats.dirty_sync_count, 1);
-
     if (!rs->time_last_bitmap_sync) {
         rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     }
@@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
         rs->num_dirty_pages_period = 0;
         rs->bytes_xfer_prev = migration_transferred_bytes();
     }
-    if (migrate_events()) {
-        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
-        qapi_event_send_migration_pass(generation);
-    }
 }
 
 void migration_bitmap_sync_precopy(bool last_stage)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/14] migration: Move iteration counter out of RAM
  2026-04-08 16:55 ` [PATCH 09/14] migration: Move iteration counter out of RAM Peter Xu
@ 2026-04-09 22:14   ` Fabiano Rosas
  2026-04-16 18:15     ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Fabiano Rosas @ 2026-04-09 22:14 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson, Hyman Huang, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> It used to hide in RAM dirty sync path.  Now with more modules being able
> to slow sync on dirty information, keeping it there may not be good anymore
> because it's not RAM's own concept for iterations: all modules should
> follow.
>
> More importantly, mgmt may try to query dirty info (to make policy
> decisions like adjusting downtime) by listening to iteration count changes
> via QMP events.  So we must make sure the boost of iterations only happens
> _after_ the dirty sync operations with whatever form (RAM's dirty bitmap
> sync, or VFIO's different ioctls to fetch latest dirty info from kernel).
>
> Move this to core migration path to manage, together with the event
> generation, so that it can be well ordered with the sync operations for all
> modules.
>
> This brings a good side effect that we should have an old issue regarding
> to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
> counts (because it invokes sync ops).  Now it won't, which is actually the
> right behavior.
>

Good.

> Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> 1st iteration will always shows dirty count to 1.  Make it initialized with
> 1 this time, because we'll miss the dirty sync for setup() on boosting this
> counter now.
>
> Reviewed-by: Hyman Huang <yong.huang@smartx.com>
> Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  3 ++-
>  migration/migration.c       | 29 ++++++++++++++++++++++++++---
>  migration/ram.c             |  6 ------
>  3 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 1153520f7a..326ddb0088 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -43,7 +43,8 @@ typedef struct {
>       */
>      uint64_t dirty_pages_rate;
>      /*
> -     * Number of times we have synchronized guest bitmaps.
> +     * Number of times we have synchronized guest bitmaps.  This always
> +     * starts from 1 for the 1st iteration.
>       */
>      uint64_t dirty_sync_count;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index a9ee3360e1..57d91ad9e3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
>      s->threshold_size = 0;
>      s->switchover_acked = false;
>      s->rdma_migration = false;
> +
>      /*
> -     * set mig_stats memory to zero for a new migration
> +     * set mig_stats memory to zero for a new migration.. except the
> +     * iteration counter, which we want to make sure it returns 1 for the
> +     * first iteration.
>       */
>      memset(&mig_stats, 0, sizeof(mig_stats));
> +    mig_stats.dirty_sync_count = 1;
> +
>      migration_reset_vfio_bytes_transferred();
>  
>      s->postcopy_package_loaded = false;
> @@ -3230,10 +3235,28 @@ static bool migration_iteration_next_ready(MigrationState *s,
>  static void migration_iteration_go_next(MigPendingData *pending)
>  {
>      /*
> -     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> -     * into the core layer.
> +     * Do a slow sync first before boosting the iteration count.
>       */
>      qemu_savevm_query_pending(pending, true);
> +
> +    /*
> +     * Boost dirty sync count to reflect we finished one iteration.
> +     *
> +     * NOTE: we need to make sure when this happens (together with the
> +     * event sent below) all modules have slow-synced the pending data
> +     * above.  That means a write mem barrier, but qatomic_add() should be
> +     * enough.
> +     *
> +     * It's because a mgmt could wait on the iteration event to query again
> +     * on pending data for policy changes (e.g. downtime adjustments).  The
> +     * ordering will make sure the query will fetch the latest results from
> +     * all the modules.
> +     */
> +    qatomic_add(&mig_stats.dirty_sync_count, 1);
> +
> +    if (migrate_events()) {
> +        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
> +    }
>  }
>  
>  /*
> diff --git a/migration/ram.c b/migration/ram.c
> index e5b7217bf5..686162643d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>      RAMBlock *block;
>      int64_t end_time;
>  
> -    qatomic_add(&mig_stats.dirty_sync_count, 1);
> -
>      if (!rs->time_last_bitmap_sync) {
>          rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);

Tangent: aren't these updates to time_last_bitmap_sync racy with the
sync from the throttle code?

>      }
> @@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>          rs->num_dirty_pages_period = 0;
>          rs->bytes_xfer_prev = migration_transferred_bytes();
>      }
> -    if (migrate_events()) {
> -        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
> -        qapi_event_send_migration_pass(generation);
> -    }
>  }
>  
>  void migration_bitmap_sync_precopy(bool last_stage)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/14] migration: Move iteration counter out of RAM
  2026-04-09 22:14   ` Fabiano Rosas
@ 2026-04-16 18:15     ` Peter Xu
  2026-04-16 21:15       ` Fabiano Rosas
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-16 18:15 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson, Hyman Huang, Prasad Pandit

On Thu, Apr 09, 2026 at 07:14:15PM -0300, Fabiano Rosas wrote:
> > diff --git a/migration/ram.c b/migration/ram.c
> > index e5b7217bf5..686162643d 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
> >      RAMBlock *block;
> >      int64_t end_time;
> >  
> > -    qatomic_add(&mig_stats.dirty_sync_count, 1);
> > -
> >      if (!rs->time_last_bitmap_sync) {
> >          rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> 
> Tangent: aren't these updates to time_last_bitmap_sync racy with the
> sync from the throttle code?

It looks fine to me.  This variable (along with log sync) should require
BQL, so I expect normal form of race condition won't happen.

But maybe you meant when migration_bitmap_sync() can be invoked by the
timer kick-offed in cpu_throttle_dirty_sync_timer_tick()?

Indeed that could happen together with the migration thread trying to sync
by other reasons, but so far I see no issues: time_last_bitmap_sync might
get updated sooner, but it should always reflect the time that whoever
synced the last, and when both want to do it they contend on BQL, which
looks ok.

So the expectation is that timer just makes sure sync happens slightly more
frequently, where QEMU src might used to sync toooo less times on huge VMs
(because each iteration took tooo long).

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 09/14] migration: Move iteration counter out of RAM
  2026-04-16 18:15     ` Peter Xu
@ 2026-04-16 21:15       ` Fabiano Rosas
  0 siblings, 0 replies; 52+ messages in thread
From: Fabiano Rosas @ 2026-04-16 21:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson, Hyman Huang, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> On Thu, Apr 09, 2026 at 07:14:15PM -0300, Fabiano Rosas wrote:
>> > diff --git a/migration/ram.c b/migration/ram.c
>> > index e5b7217bf5..686162643d 100644
>> > --- a/migration/ram.c
>> > +++ b/migration/ram.c
>> > @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>> >      RAMBlock *block;
>> >      int64_t end_time;
>> >  
>> > -    qatomic_add(&mig_stats.dirty_sync_count, 1);
>> > -
>> >      if (!rs->time_last_bitmap_sync) {
>> >          rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> 
>> Tangent: aren't these updates to time_last_bitmap_sync racy with the
>> sync from the throttle code?
>
> It looks fine to me.  This variable (along with log sync) should require
> BQL, so I expect normal form of race condition won't happen.
>
> But maybe you meant when migration_bitmap_sync() can be invoked by the
> timer kick-offed in cpu_throttle_dirty_sync_timer_tick()?
>

Yep.

> Indeed that could happen together with the migration thread trying to sync
> by other reasons, but so far I see no issues: time_last_bitmap_sync might
> get updated sooner, but it should always reflect the time that whoever
> synced the last, and when both want to do it they contend on BQL, which
> looks ok.
>

Hm, I think I need to do a deeper audit, there's several access to
RAMState there. But I agree on the surface there's nothing too
egregious.

> So the expectation is that timer just makes sure sync happens slightly more
> frequently, where QEMU src might used to sync toooo less times on huge VMs
> (because each iteration took tooo long).


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 10/14] migration: Introduce a helper to return switchover bw estimate
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (8 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 09/14] migration: Move iteration counter out of RAM Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-08 16:55 ` [PATCH 11/14] migration: Calculate expected downtime on demand Peter Xu
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

Add a helper migration_get_switchover_bw() to return an estimate of
switchover bandwidth.  Use it to simplify the current code.

This will be used in later to remove expected_downtime.

When at it, remove two qatomic_read() to shrink the lines because atomic
ops are not needed when it's always the same thread who does the updates.

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 48 +++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 57d91ad9e3..a1a02e3a9f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
     migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
 }
 
+/*
+ * Returns the estimated switchover bandwidth (unit: bytes / seconds)
+ */
+static double migration_get_switchover_bw(MigrationState *s)
+{
+    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
+
+    if (switchover_bw) {
+        /* If user specified, prioritize this value and don't estimate */
+        return (double)switchover_bw;
+    }
+
+    return s->mbps / 8 * 1000 * 1000;
+}
+
 bool migration_is_running(void)
 {
     MigrationState *s = current_migration;
@@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
 {
     uint64_t transferred, transferred_pages, time_spent;
     uint64_t current_bytes; /* bytes transferred since the beginning */
-    uint64_t switchover_bw;
-    /* Expected bandwidth when switching over to destination QEMU */
-    double expected_bw_per_ms;
-    double bandwidth;
+    double switchover_bw_per_ms;
 
     if (current_time < s->iteration_start_time + BUFFER_DELAY) {
         return;
     }
 
-    switchover_bw = migrate_avail_switchover_bandwidth();
     current_bytes = migration_transferred_bytes();
     transferred = current_bytes - s->iteration_initial_bytes;
     time_spent = current_time - s->iteration_start_time;
-    bandwidth = (double)transferred / time_spent;
-
-    if (switchover_bw) {
-        /*
-         * If the user specified a switchover bandwidth, let's trust the
-         * user so that can be more accurate than what we estimated.
-         */
-        expected_bw_per_ms = (double)switchover_bw / 1000;
-    } else {
-        /* If the user doesn't specify bandwidth, we use the estimated */
-        expected_bw_per_ms = bandwidth;
-    }
-
-    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
-
     s->mbps = (((double) transferred * 8.0) /
                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
 
+    /* NOTE: only update this after bandwidth (s->mbps) updated */
+    switchover_bw_per_ms = migration_get_switchover_bw(s) / 1000;
+    s->threshold_size = switchover_bw_per_ms * migrate_downtime_limit();
+
     transferred_pages = ram_get_total_transferred_pages() -
                             s->iteration_initial_pages;
     s->pages_per_second = (double) transferred_pages /
@@ -3166,10 +3166,9 @@ static void migration_update_counters(MigrationState *s,
      * if we haven't sent anything, we don't want to
      * recalculate. 10000 is a small enough number for our purposes
      */
-    if (qatomic_read(&mig_stats.dirty_pages_rate) &&
-        transferred > 10000) {
+    if (mig_stats.dirty_pages_rate && transferred > 10000) {
         s->expected_downtime =
-            qatomic_read(&mig_stats.dirty_bytes_last_sync) / expected_bw_per_ms;
+            mig_stats.dirty_bytes_last_sync / switchover_bw_per_ms;
     }
 
     migration_rate_reset();
@@ -3178,7 +3177,8 @@ static void migration_update_counters(MigrationState *s,
 
     trace_migrate_transferred(transferred, time_spent,
                               /* Both in unit bytes/ms */
-                              bandwidth, switchover_bw / 1000,
+                              (uint64_t)s->mbps,
+                              (uint64_t)switchover_bw_per_ms,
                               s->threshold_size);
 }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 11/14] migration: Calculate expected downtime on demand
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (9 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 10/14] migration: Introduce a helper to return switchover bw estimate Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:16   ` Juraj Marcin
  2026-04-08 16:55 ` [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

This value does not need to be calculated as frequent.  Only calculate it
on demand when query-migrate happened.  With that we can remove the
variable in MigrationState.

This paves way for fixing this value to include all modules (not only RAM
but others too).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  2 +-
 migration/migration.c | 25 ++++++++++++-------------
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index e504df6915..32543800b6 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -359,7 +359,6 @@ struct MigrationState {
     /* Timestamp when VM is down (ms) to migrate the last stuff */
     int64_t downtime_start;
     int64_t downtime;
-    int64_t expected_downtime;
     bool capabilities[MIGRATION_CAPABILITY__MAX];
     int64_t setup_time;
 
@@ -594,6 +593,7 @@ void migration_cancel(void);
 void migration_populate_vfio_info(MigrationInfo *info);
 void migration_reset_vfio_bytes_transferred(void);
 void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
+int64_t migration_downtime_calc_expected(MigrationState *s);
 
 /*
  * Migration thread waiting for return path thread.  Return non-zero if an
diff --git a/migration/migration.c b/migration/migration.c
index a1a02e3a9f..f12cd9efd3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1041,6 +1041,17 @@ static bool migrate_show_downtime(MigrationState *s)
     return (s->state == MIGRATION_STATUS_COMPLETED) || migration_in_postcopy();
 }
 
+/* Return expected downtime (unit: milliseconds) */
+int64_t migration_downtime_calc_expected(MigrationState *s)
+{
+    if (mig_stats.dirty_sync_count <= 1) {
+        return migrate_downtime_limit();
+    }
+
+    return mig_stats.dirty_bytes_last_sync /
+        migration_get_switchover_bw(s) * 1000;
+}
+
 static void populate_time_info(MigrationInfo *info, MigrationState *s)
 {
     info->has_status = true;
@@ -1061,7 +1072,7 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
         info->downtime = s->downtime;
     } else {
         info->has_expected_downtime = true;
-        info->expected_downtime = s->expected_downtime;
+        info->expected_downtime = migration_downtime_calc_expected(s);
     }
 }
 
@@ -1649,7 +1660,6 @@ int migrate_init(MigrationState *s, Error **errp)
     s->mbps = 0.0;
     s->pages_per_second = 0.0;
     s->downtime = 0;
-    s->expected_downtime = 0;
     s->setup_time = 0;
     s->start_postcopy = false;
     s->migration_thread_running = false;
@@ -3162,15 +3172,6 @@ static void migration_update_counters(MigrationState *s,
     s->pages_per_second = (double) transferred_pages /
                              (((double) time_spent / 1000.0));
 
-    /*
-     * if we haven't sent anything, we don't want to
-     * recalculate. 10000 is a small enough number for our purposes
-     */
-    if (mig_stats.dirty_pages_rate && transferred > 10000) {
-        s->expected_downtime =
-            mig_stats.dirty_bytes_last_sync / switchover_bw_per_ms;
-    }
-
     migration_rate_reset();
 
     update_iteration_initial_status(s);
@@ -3816,8 +3817,6 @@ void migration_start_outgoing(MigrationState *s)
     bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
     int ret;
 
-    s->expected_downtime = migrate_downtime_limit();
-
     if (resume) {
         /* This is a resumed migration */
         rate_limit = migrate_max_postcopy_bandwidth();
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/14] migration: Calculate expected downtime on demand
  2026-04-08 16:55 ` [PATCH 11/14] migration: Calculate expected downtime on demand Peter Xu
@ 2026-04-09 17:16   ` Juraj Marcin
  0 siblings, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On 2026-04-08 12:55, Peter Xu wrote:
> This value does not need to be calculated as frequent.  Only calculate it
> on demand when query-migrate happened.  With that we can remove the
> variable in MigrationState.
> 
> This paves way for fixing this value to include all modules (not only RAM
> but others too).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h |  2 +-
>  migration/migration.c | 25 ++++++++++++-------------
>  2 files changed, 13 insertions(+), 14 deletions(-)
> 

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (10 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 11/14] migration: Calculate expected downtime on demand Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:17   ` Juraj Marcin
  2026-04-09 22:17   ` Fabiano Rosas
  2026-04-08 16:55 ` [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports Peter Xu
                   ` (3 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

QEMU will provide an expected downtime for the whole system during
migration, by remembering the total dirty RAM that we synced the last time,
divides the estimated switchover bandwidth.

That was flawed when VFIO is taking into account: consider there is a VFIO
GPU device that contains GBs of data to migrate during stop phase.  Those
will not be accounted in this math.

Fix it by updating dirty_bytes_last_sync properly only when we go to the
next iteration, rather than hide this update in the RAM code.  Meanwhile,
fetch the total (rather than RAM-only) portion of dirty bytes, so as to
include GPU device states too.

Update the comment of the field to reflect its new meaning.

Now after this change, the expected-downtime to be read from query-migrate
should be very accurate even with VFIO devices involved.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-stats.h |  8 +++-----
 migration/migration.c       | 11 ++++++++---
 migration/ram.c             |  1 -
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 326ddb0088..1447316802 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -31,11 +31,9 @@
  */
 typedef struct {
     /*
-     * Number of bytes that were dirty last time that we synced with
-     * the guest memory.  We use that to calculate the downtime.  As
-     * the remaining dirty amounts to what we know that is still dirty
-     * since last iteration, not counting what the guest has dirtied
-     * since we synchronized bitmaps.
+     * Number of bytes that were reported dirty after the lastest
+     * system-wise synchronization on dirty information.  It is used to
+     * do best-effort estimation on expected downtime.
      */
     uint64_t dirty_bytes_last_sync;
     /*
diff --git a/migration/migration.c b/migration/migration.c
index f12cd9efd3..4010e5dcf5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3240,18 +3240,23 @@ static void migration_iteration_go_next(MigPendingData *pending)
      */
     qemu_savevm_query_pending(pending, true);
 
+    /*
+     * Update the dirty information for the whole system for this
+     * iteration.  This value is used to calculate expected downtime.
+     */
+    qatomic_set(&mig_stats.dirty_bytes_last_sync, pending->total_bytes);
+
     /*
      * Boost dirty sync count to reflect we finished one iteration.
      *
      * NOTE: we need to make sure when this happens (together with the
      * event sent below) all modules have slow-synced the pending data
-     * above.  That means a write mem barrier, but qatomic_add() should be
-     * enough.
+     * above and updated corresponding fields (e.g. dirty_bytes_last_sync).
      *
      * It's because a mgmt could wait on the iteration event to query again
      * on pending data for policy changes (e.g. downtime adjustments).  The
      * ordering will make sure the query will fetch the latest results from
-     * all the modules.
+     * all the modules on everything.
      */
     qatomic_add(&mig_stats.dirty_sync_count, 1);
 
diff --git a/migration/ram.c b/migration/ram.c
index 686162643d..d927ad7508 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1148,7 +1148,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
             RAMBLOCK_FOREACH_NOT_IGNORED(block) {
                 ramblock_sync_dirty_bitmap(rs, block);
             }
-            qatomic_set(&mig_stats.dirty_bytes_last_sync, ram_bytes_remaining());
         }
     }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info
  2026-04-08 16:55 ` [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
@ 2026-04-09 17:17   ` Juraj Marcin
  2026-04-09 22:17   ` Fabiano Rosas
  1 sibling, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On 2026-04-08 12:55, Peter Xu wrote:
> QEMU will provide an expected downtime for the whole system during
> migration, by remembering the total dirty RAM that we synced the last time,
> divides the estimated switchover bandwidth.
> 
> That was flawed when VFIO is taking into account: consider there is a VFIO
> GPU device that contains GBs of data to migrate during stop phase.  Those
> will not be accounted in this math.
> 
> Fix it by updating dirty_bytes_last_sync properly only when we go to the
> next iteration, rather than hide this update in the RAM code.  Meanwhile,
> fetch the total (rather than RAM-only) portion of dirty bytes, so as to
> include GPU device states too.
> 
> Update the comment of the field to reflect its new meaning.
> 
> Now after this change, the expected-downtime to be read from query-migrate
> should be very accurate even with VFIO devices involved.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  8 +++-----
>  migration/migration.c       | 11 ++++++++---
>  migration/ram.c             |  1 -
>  3 files changed, 11 insertions(+), 9 deletions(-)
> 

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info
  2026-04-08 16:55 ` [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
  2026-04-09 17:17   ` Juraj Marcin
@ 2026-04-09 22:17   ` Fabiano Rosas
  2026-04-16 18:19     ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Fabiano Rosas @ 2026-04-09 22:17 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson

Peter Xu <peterx@redhat.com> writes:

> QEMU will provide an expected downtime for the whole system during
> migration, by remembering the total dirty RAM that we synced the last time,
> divides the estimated switchover bandwidth.
>
> That was flawed when VFIO is taking into account: consider there is a VFIO
> GPU device that contains GBs of data to migrate during stop phase.  Those
> will not be accounted in this math.
>
> Fix it by updating dirty_bytes_last_sync properly only when we go to the
> next iteration, rather than hide this update in the RAM code.  Meanwhile,
> fetch the total (rather than RAM-only) portion of dirty bytes, so as to
> include GPU device states too.
>
> Update the comment of the field to reflect its new meaning.
>
> Now after this change, the expected-downtime to be read from query-migrate
> should be very accurate even with VFIO devices involved.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  8 +++-----
>  migration/migration.c       | 11 ++++++++---
>  migration/ram.c             |  1 -
>  3 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 326ddb0088..1447316802 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -31,11 +31,9 @@
>   */
>  typedef struct {
>      /*
> -     * Number of bytes that were dirty last time that we synced with
> -     * the guest memory.  We use that to calculate the downtime.  As
> -     * the remaining dirty amounts to what we know that is still dirty
> -     * since last iteration, not counting what the guest has dirtied
> -     * since we synchronized bitmaps.
> +     * Number of bytes that were reported dirty after the lastest

typo: latest

> +     * system-wise synchronization on dirty information.  It is used to

s/on/of/ ?

I can fix these at merge time.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info
  2026-04-09 22:17   ` Fabiano Rosas
@ 2026-04-16 18:19     ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-16 18:19 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson

On Thu, Apr 09, 2026 at 07:17:44PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > QEMU will provide an expected downtime for the whole system during
> > migration, by remembering the total dirty RAM that we synced the last time,
> > divides the estimated switchover bandwidth.
> >
> > That was flawed when VFIO is taking into account: consider there is a VFIO
> > GPU device that contains GBs of data to migrate during stop phase.  Those
> > will not be accounted in this math.
> >
> > Fix it by updating dirty_bytes_last_sync properly only when we go to the
> > next iteration, rather than hide this update in the RAM code.  Meanwhile,
> > fetch the total (rather than RAM-only) portion of dirty bytes, so as to
> > include GPU device states too.
> >
> > Update the comment of the field to reflect its new meaning.
> >
> > Now after this change, the expected-downtime to be read from query-migrate
> > should be very accurate even with VFIO devices involved.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration-stats.h |  8 +++-----
> >  migration/migration.c       | 11 ++++++++---
> >  migration/ram.c             |  1 -
> >  3 files changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> > index 326ddb0088..1447316802 100644
> > --- a/migration/migration-stats.h
> > +++ b/migration/migration-stats.h
> > @@ -31,11 +31,9 @@
> >   */
> >  typedef struct {
> >      /*
> > -     * Number of bytes that were dirty last time that we synced with
> > -     * the guest memory.  We use that to calculate the downtime.  As
> > -     * the remaining dirty amounts to what we know that is still dirty
> > -     * since last iteration, not counting what the guest has dirtied
> > -     * since we synchronized bitmaps.
> > +     * Number of bytes that were reported dirty after the lastest
> 
> typo: latest
> 
> > +     * system-wise synchronization on dirty information.  It is used to
> 
> s/on/of/ ?
> 
> I can fix these at merge time.

Thanks, I'll repost and fix them.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (11 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:41   ` Juraj Marcin
                     ` (2 more replies)
  2026-04-08 16:55 ` [PATCH 14/14] migration/qapi: Update unit for avail-switchover-bandwidth Peter Xu
                   ` (2 subsequent siblings)
  15 siblings, 3 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson,
	Dr. David Alan Gilbert

Currently, mgmt can only query for remaining RAM, not system-wise remaining
data.  It was not a problem before, because for a very long time RAM was
the only part that matters.

After VFIO migrations landed upstream, it may not be true anymore
especially considering that there can be GPU devices that contain GBs of
device states.

Add a new "remaining" field in query-migrate results, reflecting
system-wise remaining data, which will include everything (e.g. VFIO).

This information will be useful for mgmt to implement generic way of stall
detection that covers all system resources.  Say, when system remaining
data does not decrease anymore for a relatively long period of time, then
it may mean that there is a challenge of converging, so mgmt can act based
on how this value changes over time (especially if sampled after each
migration iteration).

Before this patch, "expected_downtime" almost played this role. For
example, by monitoring "expected_downtime" at the beginning of each
iteration can in most cases also reflect the progress of migration
system-wise.  Said that, "expected_downtime" was always calculated based on
a bandwidth value that can fluctuate a lot if avail-switchover-bandwidth is
not used. This new "remaining" field will remove that part of uncertainty
for mgmt.

With the new field, HMP "info migrate" now reports this:

(qemu) info migrate
Status:                 active
Time (ms):              total=12080, setup=14, exp_down=300
Remaining (bytes):      1.36 GiB        <------------------- newline
RAM info:
  Throughput (Mbps):    840.50
  Sizes:                pagesize=4 KiB, total=4.02 GiB
  Transfers:            transferred=1.18 GiB, remain=1.36 GiB
    Channels:           precopy=1.18 GiB, multifd=0 B, postcopy=0 B
    Page Types:         normal=307923, zero=388148
  Page Rates (pps):     transfer=25660
  Others:               dirty_syncs=1

It should be the same value as RAM's remaining report when VFIO is not
involved, and it should report more than that when VFIO is involved.

Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json            |  4 ++++
 migration/migration-hmp-cmds.c |  5 +++++
 migration/migration.c          | 11 +++++++++++
 3 files changed, 20 insertions(+)

diff --git a/qapi/migration.json b/qapi/migration.json
index e3ad3f0604..a6e24b5685 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -300,6 +300,9 @@
 #     average memory load of the virtual CPU indirectly.  Note that
 #     zero means guest doesn't dirty memory.  (Since 8.1)
 #
+# @remaining: amount of bytes remaining to be migrated system-wise,
+#     includes both RAM and all devices (like VFIO).  (Since 11.1)
+#
 # Features:
 #
 # @unstable: Members @postcopy-latency, @postcopy-vcpu-latency,
@@ -310,6 +313,7 @@
 ##
 { 'struct': 'MigrationInfo',
   'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
+           '*remaining': 'uint64',
            '*vfio': 'VfioStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 0a193b8f54..721c211086 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -178,6 +178,11 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
         }
     }
 
+    if (info->has_remaining) {
+        g_autofree char *remaining = size_to_str(info->remaining);
+        monitor_printf(mon, "Remaining (bytes): \t%s\n", remaining);
+    }
+
     if (info->has_socket_address) {
         SocketAddressList *addr;
 
diff --git a/migration/migration.c b/migration/migration.c
index 4010e5dcf5..c2aa145106 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1076,6 +1076,16 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
     }
 }
 
+static void populate_global_info(MigrationInfo *info, MigrationState *s)
+{
+    MigPendingData data = { };
+
+    qemu_savevm_query_pending(&data, false);
+
+    info->has_remaining = true;
+    info->remaining = data.total_bytes;
+}
+
 static void populate_ram_info(MigrationInfo *info, MigrationState *s)
 {
     size_t page_size = qemu_target_page_size();
@@ -1177,6 +1187,7 @@ static void fill_source_migration_info(MigrationInfo *info)
         /* TODO add some postcopy stats */
         populate_time_info(info, s);
         populate_ram_info(info, s);
+        populate_global_info(info, s);
         migration_populate_vfio_info(info);
         break;
     case MIGRATION_STATUS_COLO:
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports
  2026-04-08 16:55 ` [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports Peter Xu
@ 2026-04-09 17:41   ` Juraj Marcin
  2026-04-09 21:48   ` Dr. David Alan Gilbert
  2026-04-09 22:21   ` Fabiano Rosas
  2 siblings, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson, Dr. David Alan Gilbert

On 2026-04-08 12:55, Peter Xu wrote:
> Currently, mgmt can only query for remaining RAM, not system-wise remaining
> data.  It was not a problem before, because for a very long time RAM was
> the only part that matters.
> 
> After VFIO migrations landed upstream, it may not be true anymore
> especially considering that there can be GPU devices that contain GBs of
> device states.
> 
> Add a new "remaining" field in query-migrate results, reflecting
> system-wise remaining data, which will include everything (e.g. VFIO).
> 
> This information will be useful for mgmt to implement generic way of stall
> detection that covers all system resources.  Say, when system remaining
> data does not decrease anymore for a relatively long period of time, then
> it may mean that there is a challenge of converging, so mgmt can act based
> on how this value changes over time (especially if sampled after each
> migration iteration).
> 
> Before this patch, "expected_downtime" almost played this role. For
> example, by monitoring "expected_downtime" at the beginning of each
> iteration can in most cases also reflect the progress of migration
> system-wise.  Said that, "expected_downtime" was always calculated based on
> a bandwidth value that can fluctuate a lot if avail-switchover-bandwidth is
> not used. This new "remaining" field will remove that part of uncertainty
> for mgmt.
> 
> With the new field, HMP "info migrate" now reports this:
> 
> (qemu) info migrate
> Status:                 active
> Time (ms):              total=12080, setup=14, exp_down=300
> Remaining (bytes):      1.36 GiB        <------------------- newline
> RAM info:
>   Throughput (Mbps):    840.50
>   Sizes:                pagesize=4 KiB, total=4.02 GiB
>   Transfers:            transferred=1.18 GiB, remain=1.36 GiB
>     Channels:           precopy=1.18 GiB, multifd=0 B, postcopy=0 B
>     Page Types:         normal=307923, zero=388148
>   Page Rates (pps):     transfer=25660
>   Others:               dirty_syncs=1
> 
> It should be the same value as RAM's remaining report when VFIO is not
> involved, and it should report more than that when VFIO is involved.
> 
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  qapi/migration.json            |  4 ++++
>  migration/migration-hmp-cmds.c |  5 +++++
>  migration/migration.c          | 11 +++++++++++
>  3 files changed, 20 insertions(+)
> 

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports
  2026-04-08 16:55 ` [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports Peter Xu
  2026-04-09 17:41   ` Juraj Marcin
@ 2026-04-09 21:48   ` Dr. David Alan Gilbert
  2026-04-16 18:25     ` Peter Xu
  2026-04-09 22:21   ` Fabiano Rosas
  2 siblings, 1 reply; 52+ messages in thread
From: Dr. David Alan Gilbert @ 2026-04-09 21:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

* Peter Xu (peterx@redhat.com) wrote:
> Currently, mgmt can only query for remaining RAM, not system-wise remaining
> data.  It was not a problem before, because for a very long time RAM was
> the only part that matters.
> 
> After VFIO migrations landed upstream, it may not be true anymore
> especially considering that there can be GPU devices that contain GBs of
> device states.
> 
> Add a new "remaining" field in query-migrate results, reflecting
> system-wise remaining data, which will include everything (e.g. VFIO).

Of course you realise the next thing people will ask for is being able
to ask *which* vfio device is the one that's busy.

Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>

> This information will be useful for mgmt to implement generic way of stall
> detection that covers all system resources.  Say, when system remaining
> data does not decrease anymore for a relatively long period of time, then
> it may mean that there is a challenge of converging, so mgmt can act based
> on how this value changes over time (especially if sampled after each
> migration iteration).
> 
> Before this patch, "expected_downtime" almost played this role. For
> example, by monitoring "expected_downtime" at the beginning of each
> iteration can in most cases also reflect the progress of migration
> system-wise.  Said that, "expected_downtime" was always calculated based on
> a bandwidth value that can fluctuate a lot if avail-switchover-bandwidth is
> not used. This new "remaining" field will remove that part of uncertainty
> for mgmt.
> 
> With the new field, HMP "info migrate" now reports this:
> 
> (qemu) info migrate
> Status:                 active
> Time (ms):              total=12080, setup=14, exp_down=300
> Remaining (bytes):      1.36 GiB        <------------------- newline
> RAM info:
>   Throughput (Mbps):    840.50
>   Sizes:                pagesize=4 KiB, total=4.02 GiB
>   Transfers:            transferred=1.18 GiB, remain=1.36 GiB
>     Channels:           precopy=1.18 GiB, multifd=0 B, postcopy=0 B
>     Page Types:         normal=307923, zero=388148
>   Page Rates (pps):     transfer=25660
>   Others:               dirty_syncs=1
> 
> It should be the same value as RAM's remaining report when VFIO is not
> involved, and it should report more than that when VFIO is involved.
> 
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  qapi/migration.json            |  4 ++++
>  migration/migration-hmp-cmds.c |  5 +++++
>  migration/migration.c          | 11 +++++++++++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index e3ad3f0604..a6e24b5685 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -300,6 +300,9 @@
>  #     average memory load of the virtual CPU indirectly.  Note that
>  #     zero means guest doesn't dirty memory.  (Since 8.1)
>  #
> +# @remaining: amount of bytes remaining to be migrated system-wise,
> +#     includes both RAM and all devices (like VFIO).  (Since 11.1)
> +#
>  # Features:
>  #
>  # @unstable: Members @postcopy-latency, @postcopy-vcpu-latency,
> @@ -310,6 +313,7 @@
>  ##
>  { 'struct': 'MigrationInfo',
>    'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
> +           '*remaining': 'uint64',
>             '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 0a193b8f54..721c211086 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -178,6 +178,11 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>          }
>      }
>  
> +    if (info->has_remaining) {
> +        g_autofree char *remaining = size_to_str(info->remaining);
> +        monitor_printf(mon, "Remaining (bytes): \t%s\n", remaining);
> +    }
> +
>      if (info->has_socket_address) {
>          SocketAddressList *addr;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 4010e5dcf5..c2aa145106 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1076,6 +1076,16 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>      }
>  }
>  
> +static void populate_global_info(MigrationInfo *info, MigrationState *s)
> +{
> +    MigPendingData data = { };
> +
> +    qemu_savevm_query_pending(&data, false);
> +
> +    info->has_remaining = true;
> +    info->remaining = data.total_bytes;
> +}
> +
>  static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>  {
>      size_t page_size = qemu_target_page_size();
> @@ -1177,6 +1187,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>          /* TODO add some postcopy stats */
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
> +        populate_global_info(info, s);
>          migration_populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_COLO:
> -- 
> 2.53.0
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports
  2026-04-09 21:48   ` Dr. David Alan Gilbert
@ 2026-04-16 18:25     ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-16 18:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

On Thu, Apr 09, 2026 at 09:48:57PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Currently, mgmt can only query for remaining RAM, not system-wise remaining
> > data.  It was not a problem before, because for a very long time RAM was
> > the only part that matters.
> > 
> > After VFIO migrations landed upstream, it may not be true anymore
> > especially considering that there can be GPU devices that contain GBs of
> > device states.
> > 
> > Add a new "remaining" field in query-migrate results, reflecting
> > system-wise remaining data, which will include everything (e.g. VFIO).
> 
> Of course you realise the next thing people will ask for is being able
> to ask *which* vfio device is the one that's busy.

Heh, true..

Since so far this effort is led by a requirement to detect migration stalls
with VFIO devices attached, I'll try to keep this API as simple as
possible, until someone else complains with another good reason..

> 
> Reviewed-by: Dr. David Alan Gilbert <dave@treblig.org>

Thank, Dave.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports
  2026-04-08 16:55 ` [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports Peter Xu
  2026-04-09 17:41   ` Juraj Marcin
  2026-04-09 21:48   ` Dr. David Alan Gilbert
@ 2026-04-09 22:21   ` Fabiano Rosas
  2026-04-16 18:26     ` Peter Xu
  2 siblings, 1 reply; 52+ messages in thread
From: Fabiano Rosas @ 2026-04-09 22:21 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson, Dr. David Alan Gilbert

Peter Xu <peterx@redhat.com> writes:

> Currently, mgmt can only query for remaining RAM, not system-wise remaining
> data.  It was not a problem before, because for a very long time RAM was
> the only part that matters.
>
> After VFIO migrations landed upstream, it may not be true anymore
> especially considering that there can be GPU devices that contain GBs of
> device states.
>
> Add a new "remaining" field in query-migrate results, reflecting
> system-wise remaining data, which will include everything (e.g. VFIO).
>
> This information will be useful for mgmt to implement generic way of stall
> detection that covers all system resources.  Say, when system remaining
> data does not decrease anymore for a relatively long period of time, then
> it may mean that there is a challenge of converging, so mgmt can act based
> on how this value changes over time (especially if sampled after each
> migration iteration).
>
> Before this patch, "expected_downtime" almost played this role. For
> example, by monitoring "expected_downtime" at the beginning of each
> iteration can in most cases also reflect the progress of migration
> system-wise.  Said that, "expected_downtime" was always calculated based on
> a bandwidth value that can fluctuate a lot if avail-switchover-bandwidth is
> not used. This new "remaining" field will remove that part of uncertainty
> for mgmt.
>
> With the new field, HMP "info migrate" now reports this:
>
> (qemu) info migrate
> Status:                 active
> Time (ms):              total=12080, setup=14, exp_down=300
> Remaining (bytes):      1.36 GiB        <------------------- newline

Either bytes or GiB. Better to simply remove the "(bytes)" string.

> RAM info:
>   Throughput (Mbps):    840.50
>   Sizes:                pagesize=4 KiB, total=4.02 GiB
>   Transfers:            transferred=1.18 GiB, remain=1.36 GiB
>     Channels:           precopy=1.18 GiB, multifd=0 B, postcopy=0 B
>     Page Types:         normal=307923, zero=388148
>   Page Rates (pps):     transfer=25660
>   Others:               dirty_syncs=1
>
> It should be the same value as RAM's remaining report when VFIO is not
> involved, and it should report more than that when VFIO is involved.
>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  qapi/migration.json            |  4 ++++
>  migration/migration-hmp-cmds.c |  5 +++++
>  migration/migration.c          | 11 +++++++++++
>  3 files changed, 20 insertions(+)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index e3ad3f0604..a6e24b5685 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -300,6 +300,9 @@
>  #     average memory load of the virtual CPU indirectly.  Note that
>  #     zero means guest doesn't dirty memory.  (Since 8.1)
>  #
> +# @remaining: amount of bytes remaining to be migrated system-wise,
> +#     includes both RAM and all devices (like VFIO).  (Since 11.1)
> +#
>  # Features:
>  #
>  # @unstable: Members @postcopy-latency, @postcopy-vcpu-latency,
> @@ -310,6 +313,7 @@
>  ##
>  { 'struct': 'MigrationInfo',
>    'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
> +           '*remaining': 'uint64',
>             '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 0a193b8f54..721c211086 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -178,6 +178,11 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>          }
>      }
>  
> +    if (info->has_remaining) {
> +        g_autofree char *remaining = size_to_str(info->remaining);
> +        monitor_printf(mon, "Remaining (bytes): \t%s\n", remaining);
> +    }
> +
>      if (info->has_socket_address) {
>          SocketAddressList *addr;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 4010e5dcf5..c2aa145106 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1076,6 +1076,16 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>      }
>  }
>  
> +static void populate_global_info(MigrationInfo *info, MigrationState *s)
> +{
> +    MigPendingData data = { };
> +
> +    qemu_savevm_query_pending(&data, false);
> +
> +    info->has_remaining = true;
> +    info->remaining = data.total_bytes;
> +}
> +
>  static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>  {
>      size_t page_size = qemu_target_page_size();
> @@ -1177,6 +1187,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>          /* TODO add some postcopy stats */
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
> +        populate_global_info(info, s);
>          migration_populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_COLO:


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports
  2026-04-09 22:21   ` Fabiano Rosas
@ 2026-04-16 18:26     ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-16 18:26 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Joao Martins,
	Markus Armbruster, Alex Williamson, Dr. David Alan Gilbert

On Thu, Apr 09, 2026 at 07:21:22PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Currently, mgmt can only query for remaining RAM, not system-wise remaining
> > data.  It was not a problem before, because for a very long time RAM was
> > the only part that matters.
> >
> > After VFIO migrations landed upstream, it may not be true anymore
> > especially considering that there can be GPU devices that contain GBs of
> > device states.
> >
> > Add a new "remaining" field in query-migrate results, reflecting
> > system-wise remaining data, which will include everything (e.g. VFIO).
> >
> > This information will be useful for mgmt to implement generic way of stall
> > detection that covers all system resources.  Say, when system remaining
> > data does not decrease anymore for a relatively long period of time, then
> > it may mean that there is a challenge of converging, so mgmt can act based
> > on how this value changes over time (especially if sampled after each
> > migration iteration).
> >
> > Before this patch, "expected_downtime" almost played this role. For
> > example, by monitoring "expected_downtime" at the beginning of each
> > iteration can in most cases also reflect the progress of migration
> > system-wise.  Said that, "expected_downtime" was always calculated based on
> > a bandwidth value that can fluctuate a lot if avail-switchover-bandwidth is
> > not used. This new "remaining" field will remove that part of uncertainty
> > for mgmt.
> >
> > With the new field, HMP "info migrate" now reports this:
> >
> > (qemu) info migrate
> > Status:                 active
> > Time (ms):              total=12080, setup=14, exp_down=300
> > Remaining (bytes):      1.36 GiB        <------------------- newline
> 
> Either bytes or GiB. Better to simply remove the "(bytes)" string.

Yeah, I was referring to above line but then I found we have unit for this
line, so "(bytes)" is not needed.  I'll drop.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 14/14] migration/qapi: Update unit for avail-switchover-bandwidth
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (12 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports Peter Xu
@ 2026-04-08 16:55 ` Peter Xu
  2026-04-09 17:40   ` Juraj Marcin
  2026-04-08 18:37 ` [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-04-13 16:09 ` Cédric Le Goater
  15 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2026-04-08 16:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Peter Xu, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Cédric Le Goater, Fabiano Rosas,
	Joao Martins, Markus Armbruster, Alex Williamson

Add ", in bytes per second".  Unfortunately indentations need to be updated
completely, but no change on the rest.

Cc: Markus Armbruster <armbru@redhat.com>
Suggested-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index a6e24b5685..b7518b29c6 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -921,15 +921,15 @@
 #     (Since 2.8)
 #
 # @avail-switchover-bandwidth: to set the available bandwidth that
-#     migration can use during switchover phase.  **Note:** this does
-#     not limit the bandwidth during switchover, but only for
-#     calculations when making decisions to switchover.  By default,
-#     this value is zero, which means QEMU will estimate the bandwidth
-#     automatically.  This can be set when the estimated value is not
-#     accurate, while the user is able to guarantee such bandwidth is
-#     available when switching over.  When specified correctly, this
-#     can make the switchover decision much more accurate.
-#     (Since 8.2)
+#     migration can use during switchover phase, in bytes per
+#     second.  **Note:** this does not limit the bandwidth during
+#     switchover, but only for calculations when making decisions to
+#     switchover.  By default, this value is zero, which means QEMU
+#     will estimate the bandwidth automatically.  This can be set
+#     when the estimated value is not accurate, while the user is
+#     able to guarantee such bandwidth is available when switching
+#     over.  When specified correctly, this can make the switchover
+#     decision much more accurate.  (Since 8.2)
 #
 # @downtime-limit: set maximum tolerated downtime for migration.
 #     maximum downtime in milliseconds (Since 2.8)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 14/14] migration/qapi: Update unit for avail-switchover-bandwidth
  2026-04-08 16:55 ` [PATCH 14/14] migration/qapi: Update unit for avail-switchover-bandwidth Peter Xu
@ 2026-04-09 17:40   ` Juraj Marcin
  0 siblings, 0 replies; 52+ messages in thread
From: Juraj Marcin @ 2026-04-09 17:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On 2026-04-08 12:55, Peter Xu wrote:
> Add ", in bytes per second".  Unfortunately indentations need to be updated
> completely, but no change on the rest.
> 
> Cc: Markus Armbruster <armbru@redhat.com>
> Suggested-by: Juraj Marcin <jmarcin@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  qapi/migration.json | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index a6e24b5685..b7518b29c6 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -921,15 +921,15 @@
>  #     (Since 2.8)
>  #
>  # @avail-switchover-bandwidth: to set the available bandwidth that
> -#     migration can use during switchover phase.  **Note:** this does
> -#     not limit the bandwidth during switchover, but only for
> -#     calculations when making decisions to switchover.  By default,
> -#     this value is zero, which means QEMU will estimate the bandwidth
> -#     automatically.  This can be set when the estimated value is not
> -#     accurate, while the user is able to guarantee such bandwidth is
> -#     available when switching over.  When specified correctly, this
> -#     can make the switchover decision much more accurate.
> -#     (Since 8.2)
> +#     migration can use during switchover phase, in bytes per
> +#     second.  **Note:** this does not limit the bandwidth during
> +#     switchover, but only for calculations when making decisions to
> +#     switchover.  By default, this value is zero, which means QEMU
> +#     will estimate the bandwidth automatically.  This can be set
> +#     when the estimated value is not accurate, while the user is
> +#     able to guarantee such bandwidth is available when switching
> +#     over.  When specified correctly, this can make the switchover
> +#     decision much more accurate.  (Since 8.2)
>  #
>  # @downtime-limit: set maximum tolerated downtime for migration.
>  #     maximum downtime in milliseconds (Since 2.8)
> -- 
> 2.53.0
>

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (13 preceding siblings ...)
  2026-04-08 16:55 ` [PATCH 14/14] migration/qapi: Update unit for avail-switchover-bandwidth Peter Xu
@ 2026-04-08 18:37 ` Peter Xu
  2026-04-13 16:09 ` Cédric Le Goater
  15 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Cédric Le Goater, Fabiano Rosas, Joao Martins,
	Markus Armbruster, Alex Williamson

On Wed, Apr 08, 2026 at 12:55:44PM -0400, Peter Xu wrote:
> Tests
> =====

Re-inserting all the commands I used for testing below; they got ignored
when posting the cover letter as comments.

> 
> Tested this series with an assigned VFIO device GRID RTX6000-2B, FB memory
> 2GB.
> 
> The test covers both correct reporting of system-wise remaining data (which
> used to only cover RAM), and the expected downtime.  I verified that using
> the expected downtime I can converge a VFIO migration immediately according
> to the value reported. Test process as below:
> 
> Start the VM and kick off migration until it spins at the end, not
> converging with default 300ms downtime.  It's common for a 2GB vGPU device
> due to both huge stopsize reported and dramally small mbps reported.
> 
> As a start, update avail-switchover (I chose 1GB over a real 10Gbps port):

  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter avail-switchover-bandwidth 1G"

> This will stablize bandwidth.
> 
> Libvirt's domjobinfo won't be able to see the real remaining data because
> libvirt still doesn't support the new "remaining" field, however we can
> still see expected_downtime will be reported correctly now (instead of
> reporting zero, before this patch applied):

  # virsh domjobinfo $vm | grep -E "Expected|remaining"

> Data remaining:   0.000 B
> Memory remaining: 0.000 B
> Expected downtime: 1910         ms
> 
> If we peek through QEMU monitor, we'll see with the change the system-wise
> remaining data to be 1.9GB (even if RAM keeps reporting 0), and expected
> downtime keeps the same as what domjobinfo reports as 1.9 seconds:

  # virsh qemu-monitor-command $vm --hmp "info migrate"

> Status:                 active
> Time (ms):              total=336919, setup=10, exp_down=1910
> Remaining (bytes):      1.91 GiB
> RAM info:
>   Throughput (Mbps):    460.09
>   Sizes:                pagesize=4 KiB, total=32 GiB
>   Transfers:            transferred=12.7 GiB, remain=0 B
>     Channels:           precopy=12.7 GiB, multifd=0 B, postcopy=0 B, vfio=0 B
>     Page Types:         normal=3306906, zero=7745576
>   Page Rates (pps):     transfer=14010, dirty=8039
>   Others:               dirty_syncs=247045
> 
> It means 1.91 seconds are required as lowest downtime per math.
> 
> We can try to set something lower than that, migration will not converge:

  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter downtime-limit 1000"
  ...
  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter downtime-limit 1500"
  ...

> Then if we update downtime_limit to be slightly larger than expected downtime:

  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter downtime-limit 2000"

> Migration will complete almost immediately.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports
  2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (14 preceding siblings ...)
  2026-04-08 18:37 ` [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
@ 2026-04-13 16:09 ` Cédric Le Goater
  2026-04-15 16:06   ` Peter Xu
  15 siblings, 1 reply; 52+ messages in thread
From: Cédric Le Goater @ 2026-04-13 16:09 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, Daniel P . Berrangé, Zhiyi Guo,
	Juraj Marcin, Prasad Pandit, Avihai Horon, Kirti Wankhede,
	Fabiano Rosas, Joao Martins, Markus Armbruster, Alex Williamson

On 4/8/26 18:55, Peter Xu wrote:
> CI: https://gitlab.com/peterx/qemu/-/pipelines/2437886506
> rfc: https://lore.kernel.org/r/20260319231302.123135-1-peterx@redhat.com
> 
> This is v1 of this series.  I dropped RFC because I feel like I collected
> enough feedback on previous version on what is uncertain, meanwhile I also
> managed to borrow a system with nVidia RTX6000 2GB vGPU and tested it.
> 
> There're too many trivial things changed since RFC->v1 here, let me only
> mention what majorly has changed:
> 
> - This version will assume both VFIO ioctls (reporting either precopy or
>    stopcopy size) may report anything (say, garbage), and it shouldn't crash
>    QEMU.  It will affect what got reported with downtime or remaining data,
>    but that's best effort so it's expected.  With that in mind, I dropped
>    patch 3 as Avihai suggested.  IOW, I expect no concern on either overflow
>    / underflow or atomicity on reading these values from the VFIO drivers.
> 
> - The cached stopcopy_bytes for VFIO reflects always the total size
>    (includes precopy sizes).
> 
> - Introduced one new patch to report "system-wide" remaining data, which
>    will start to include VFIO remaining device data.  We can't squash that
>    directly into "ram" section of query-migrate QMP results, so I introduced
>    a new "remaining" field in query-migrate result for it.
> 
> - One more patch "migration: Make qemu_savevm_query_pending() available
>    anytime" trying to fix a very hard to hit race condition I found when
>    testing against virtio-net-failover tests.  I can only hit it if I run
>    tens of concurrent tests, but it will be needed to fix a crash.
> 
> Otherwise the major things should be kept almost as-is.  I should also
> addressed all comments I received from rfc version.  Please shoot if I
> missed something.
> 
> Overview
> ========
> 
> VFIO migration was merged quite a while, but we do still see things off
> here and there.  This series tries to address some of them, but only based
> on my limited understandings.
> 
> Two major issues I wanted to resolve:
> 
> (1) VFIO reports state_pending_{exact|estimate}() differently
> 
> It reports stop-only sizes in exact() only (which includes both precopy and
> stopcopy data), while in estimate() it only reports precopy data.  This is
> violating the API.  It was done like it to trigger proper sync on the VFIO
> ioctls only but it was only a workaround.  This series should fix it by
> introducing stopcopy size reporting facility for vmstate handlers.
> 
> (2) expected_downtime / remaining doesn't take VFIO devices into account
> 
> When query migration, QEMU reports one field called "expected-downtime".
> The document was phrasing this almost from RAM perspective, but ideally it
> should be about an estimated blackout window (in milliseconds) if we
> switchover anytime, based on known information.
> 
> This didn't yet took VFIO into account, especially in the case of VFIO
> devices that may contain a large amount of device states (like GPUs).
> 
> For problem (2), the use case should be that an mgmt app when migrating a
> VFIO GPU device needs to always adjust downtime for migration to converge,
> because when it's involved normal downtime like 300ms will normally not
> suffice.
> 
> Now the issue with that is the mgmt doesn't have a good way to know exactly
> how well the precopy goes with the whole system and the GPU device.
> 
> The hope is fixed expected_downtime will provide one way for the mgmt app
> to have a reasonable hint for downtime to setup to converge a migration.
> 
> Meanwhile, with a system-wise "remaining" field introduced, mgmt can query
> this results at beginning of each iteration to know if a stall is
> happening, IOW, if it's likely that this migration will not converge at
> all.  When detected, mgmt can start to consider the expected_downtime value
> reported above for converging this migration.  See more on testing below.
> 
> Tests
> =====
> 
> Tested this series with an assigned VFIO device GRID RTX6000-2B, FB memory
> 2GB.
> 
> The test covers both correct reporting of system-wise remaining data (which
> used to only cover RAM), and the expected downtime.  I verified that using
> the expected downtime I can converge a VFIO migration immediately according
> to the value reported. Test process as below:
> 
> Start the VM and kick off migration until it spins at the end, not
> converging with default 300ms downtime.  It's common for a 2GB vGPU device
> due to both huge stopsize reported and dramally small mbps reported.
> 
> As a start, update avail-switchover (I chose 1GB over a real 10Gbps port):
> 
> This will stablize bandwidth.
> 
> Libvirt's domjobinfo won't be able to see the real remaining data because
> libvirt still doesn't support the new "remaining" field, however we can
> still see expected_downtime will be reported correctly now (instead of
> reporting zero, before this patch applied):
> 
> Data remaining:   0.000 B
> Memory remaining: 0.000 B
> Expected downtime: 1910         ms
> 
> If we peek through QEMU monitor, we'll see with the change the system-wise
> remaining data to be 1.9GB (even if RAM keeps reporting 0), and expected
> downtime keeps the same as what domjobinfo reports as 1.9 seconds:
> 
> Status:                 active
> Time (ms):              total=336919, setup=10, exp_down=1910
> Remaining (bytes):      1.91 GiB
> RAM info:
>    Throughput (Mbps):    460.09
>    Sizes:                pagesize=4 KiB, total=32 GiB
>    Transfers:            transferred=12.7 GiB, remain=0 B
>      Channels:           precopy=12.7 GiB, multifd=0 B, postcopy=0 B, vfio=0 B
>      Page Types:         normal=3306906, zero=7745576
>    Page Rates (pps):     transfer=14010, dirty=8039
>    Others:               dirty_syncs=247045
> 
> It means 1.91 seconds are required as lowest downtime per math.
> 
> We can try to set something lower than that, migration will not converge:
> 
> ...
> ...
> 
> Then if we update downtime_limit to be slightly larger than expected downtime:
> 
> Migration will complete almost immediately.


I tested the series by performing a migration of a RHEL9 VM with a vGPU
(NVIDIA L4-2B) and an MLX5 VF, from a RHEL9 host (vGPU mdev) to a RHEL10
host (vGPU VF), with the vGPU under load.

The expected downtime was fluctuating around 2600 ms. Bandwidth was
around 950 Mbps. Setting downtime to 3000ms made the migration
converge quickly.

Tested-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> 
> Peter Xu (14):
>    migration: Fix low possibility downtime violation
>    migration/qapi: Rename MigrationStats to MigrationRAMStats
>    vfio/migration: Cache stop size in VFIOMigration
>    migration/treewide: Merge @state_pending_{exact|estimate} APIs
>    migration: Use the new save_query_pending() API directly
>    migration: Introduce stopcopy_bytes in save_query_pending()
>    vfio/migration: Fix incorrect reporting for VFIO pending data
>    migration: Make qemu_savevm_query_pending() available anytime
>    migration: Move iteration counter out of RAM
>    migration: Introduce a helper to return switchover bw estimate
>    migration: Calculate expected downtime on demand
>    migration: Fix calculation of expected_downtime to take VFIO info
>    migration/qapi: Introduce system-wise "remaining" reports
>    migration/qapi: Update unit for avail-switchover-bandwidth
> 
>   docs/about/removed-features.rst   |   2 +-
>   docs/devel/migration/main.rst     |   9 +-
>   docs/devel/migration/vfio.rst     |   9 +-
>   qapi/migration.json               |  32 +++---
>   hw/vfio/vfio-migration-internal.h |   8 ++
>   include/migration/register.h      |  59 ++++------
>   migration/migration-stats.h       |  13 ++-
>   migration/migration.h             |  10 +-
>   migration/savevm.h                |   9 +-
>   hw/s390x/s390-stattrib.c          |   9 +-
>   hw/vfio/migration.c               |  92 +++++++++-------
>   migration/block-dirty-bitmap.c    |  10 +-
>   migration/migration-hmp-cmds.c    |   5 +
>   migration/migration.c             | 172 +++++++++++++++++++++---------
>   migration/ram.c                   |  40 ++-----
>   migration/savevm.c                |  73 +++++++------
>   hw/vfio/trace-events              |   3 +-
>   migration/trace-events            |   3 +-
>   18 files changed, 313 insertions(+), 245 deletions(-)
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports
  2026-04-13 16:09 ` Cédric Le Goater
@ 2026-04-15 16:06   ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2026-04-15 16:06 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-devel, Maciej S . Szmigiero, Daniel P . Berrangé,
	Zhiyi Guo, Juraj Marcin, Prasad Pandit, Avihai Horon,
	Kirti Wankhede, Fabiano Rosas, Joao Martins, Markus Armbruster,
	Alex Williamson

On Mon, Apr 13, 2026 at 06:09:15PM +0200, Cédric Le Goater wrote:
> I tested the series by performing a migration of a RHEL9 VM with a vGPU
> (NVIDIA L4-2B) and an MLX5 VF, from a RHEL9 host (vGPU mdev) to a RHEL10
> host (vGPU VF), with the vGPU under load.
> 
> The expected downtime was fluctuating around 2600 ms. Bandwidth was
> around 950 Mbps. Setting downtime to 3000ms made the migration
> converge quickly.
> 
> Tested-by: Cédric Le Goater <clg@redhat.com>

Thanks, Cédric.  I'll make bold to add this to patch 13 when repost.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2026-04-20 15:57 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 16:55 [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-04-08 16:55 ` [PATCH 01/14] migration: Fix low possibility downtime violation Peter Xu
2026-04-08 16:55 ` [PATCH 02/14] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
2026-04-09 17:08   ` Juraj Marcin
2026-04-10 11:10   ` Michal Prívozník
2026-04-15 16:09     ` Peter Xu
2026-04-08 16:55 ` [PATCH 03/14] vfio/migration: Cache stop size in VFIOMigration Peter Xu
2026-04-13  9:52   ` Avihai Horon
2026-04-08 16:55 ` [PATCH 04/14] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
2026-04-09 17:10   ` Juraj Marcin
2026-04-15 16:23     ` Peter Xu
2026-04-16  8:24       ` Juraj Marcin
2026-04-13  9:57   ` Avihai Horon
2026-04-16 14:01     ` Peter Xu
2026-04-16 14:18   ` Jason J. Herne
2026-04-08 16:55 ` [PATCH 05/14] migration: Use the new save_query_pending() API directly Peter Xu
2026-04-13  9:59   ` Avihai Horon
2026-04-08 16:55 ` [PATCH 06/14] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
2026-04-09 17:13   ` Juraj Marcin
2026-04-09 17:36   ` Juraj Marcin
2026-04-16 17:20     ` Peter Xu
2026-04-17 10:18       ` Juraj Marcin
2026-04-13 10:34   ` Avihai Horon
2026-04-08 16:55 ` [PATCH 07/14] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
2026-04-13 10:56   ` Avihai Horon
2026-04-08 16:55 ` [PATCH 08/14] migration: Make qemu_savevm_query_pending() available anytime Peter Xu
2026-04-09 17:15   ` Juraj Marcin
2026-04-16 18:06     ` Peter Xu
2026-04-17 10:26       ` Juraj Marcin
2026-04-20 15:56         ` Peter Xu
2026-04-08 16:55 ` [PATCH 09/14] migration: Move iteration counter out of RAM Peter Xu
2026-04-09 22:14   ` Fabiano Rosas
2026-04-16 18:15     ` Peter Xu
2026-04-16 21:15       ` Fabiano Rosas
2026-04-08 16:55 ` [PATCH 10/14] migration: Introduce a helper to return switchover bw estimate Peter Xu
2026-04-08 16:55 ` [PATCH 11/14] migration: Calculate expected downtime on demand Peter Xu
2026-04-09 17:16   ` Juraj Marcin
2026-04-08 16:55 ` [PATCH 12/14] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
2026-04-09 17:17   ` Juraj Marcin
2026-04-09 22:17   ` Fabiano Rosas
2026-04-16 18:19     ` Peter Xu
2026-04-08 16:55 ` [PATCH 13/14] migration/qapi: Introduce system-wise "remaining" reports Peter Xu
2026-04-09 17:41   ` Juraj Marcin
2026-04-09 21:48   ` Dr. David Alan Gilbert
2026-04-16 18:25     ` Peter Xu
2026-04-09 22:21   ` Fabiano Rosas
2026-04-16 18:26     ` Peter Xu
2026-04-08 16:55 ` [PATCH 14/14] migration/qapi: Update unit for avail-switchover-bandwidth Peter Xu
2026-04-09 17:40   ` Juraj Marcin
2026-04-08 18:37 ` [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-04-13 16:09 ` Cédric Le Goater
2026-04-15 16:06   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.