[PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports
@ 2026-03-19 23:12 Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

CI: https://gitlab.com/peterx/qemu/-/pipelines/2396755777

VFIO migration was merged quite a while, but we do still see things off
here and there.  This series tries to address some of them, but only based
on my limited understandings.

This is RFC series as I don't have VFIO devices to test.  So one can also
see this as a raw proposal on raising the issues first with a solution that
hasn't been well tested.  However I tested non-vfio side and so far no
issue I'm aware.

Two major issues I wanted to resolve here:

(1) VFIO reports state_pending_{exact|estimate}() differently

It reports stop-only sizes in exact() only (which includes both precopy and
stopcopy data), while in estimate() it only reports precopy data.  This is
violating the API.  The guess was it was done like it to trigger proper
sync on the VFIO ioctls only.  This series should fix it by introducing
stopcopy size reporting facility for vmstate handlers.

(2) expected_downtime doesn't take VFIO devices into account

When query migration, QEMU reports one field called "expected-downtime".
The document was phrasing this almost from RAM perspective, but ideally it
should be about an estimated blackout window (in milliseconds) if we
switchover anytime, based on known information.

This didn't yet took VFIO into account, especially in the case of VFIO
devices that may contain a large amount of device states (like GPUs).

For problem (2), the use case should be that an mgmt app when migrating a
VFIO GPU device needs to always adjust downtime for migration to converge,
because when it's involved normal downtime like 300ms will normally not
suffice.

Now the issue with that is the mgmt doesn't have a good way to know exactly
how well the precopy goes with the whole system and the GPU device.

The hope is fixing expected_downtime may at least provide one way for the
mgmt app so that it can monitor this field in query-migrate at the start of
each iteration (by enabling events) to guess the progress, and it also may
provide a relatively reasonable hint for downtime at least in case of GPUs.
For this part, I tested nothing, so it's only a guess for now, but that's
the wish it'll be something easier to use than before.

When without GPU, mgmt can also monitor this field (which so far is the
only global field that one can query) so as to know how iterations are
making progresses, so the mgmt should expect to see expected_downtime
shrinking over iterations, and when it stops shrinking for a few iterations
maybe it's wise to do something about it.

Tests or reviews will be very much welcomed.

Thanks,

Peter Xu (12):
  migration: Fix low possibility downtime violation
  migration/qapi: Rename MigrationStats to MigrationRAMStats
  vfio/migration: Throttle vfio_save_block() on data size to read
  vfio/migration: Cache stop size in VFIOMigration
  migration/treewide: Merge @state_pending_{exact|estimate} APIs
  migration: Use the new save_query_pending() API directly
  migration: Introduce stopcopy_bytes in save_query_pending()
  vfio/migration: Fix incorrect reporting for VFIO pending data
  migration: Make iteration counter out of RAM
  migration: Introduce a helper to return switchover bw estimate
  migration: Calculate expected downtime on demand
  migration: Fix calculation of expected_downtime to take VFIO info

 docs/about/removed-features.rst   |   2 +-
 docs/devel/migration/main.rst     |   5 +-
 docs/devel/migration/vfio.rst     |   9 +-
 qapi/migration.json               |   8 +-
 hw/vfio/vfio-migration-internal.h |   1 +
 include/migration/register.h      |  64 ++++++------
 migration/migration-stats.h       |  15 +--
 migration/migration.h             |   2 +-
 migration/savevm.h                |   7 +-
 hw/s390x/s390-stattrib.c          |   8 +-
 hw/vfio/migration.c               | 114 ++++++++++++---------
 migration/block-dirty-bitmap.c    |   9 +-
 migration/migration.c             | 159 +++++++++++++++++++++---------
 migration/ram.c                   |  39 ++------
 migration/savevm.c                |  37 ++-----
 hw/vfio/trace-events              |   3 +-
 migration/trace-events            |   1 +
 17 files changed, 254 insertions(+), 229 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-20 12:26   ` Prasad Pandit
  2026-03-27 15:05   ` Juraj Marcin
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

When QEMU queried the estimated version of pending data and thinks it's
ready to converge, it'll send another accurate query to make sure of it.
It is needed to make sure we collect the latest reports and that equation
still holds true.

However we missed one tiny little difference here on "<" v.s. "<=" when
comparing pending_size (A) to threshold_size (B)..

QEMU src only re-query if A<B, but will kickoff switchover if A<=B.

I think it means it is possible to happen if A (as an estimate only so far)
accidentally equals to B, then re-query won't happen and switchover will
proceed without considering new dirtied data.

It turns out it was an accident in my commit 7aaa1fc072 when refactoring
the code around.  Fix this by using the same equation in both places.

Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..dfc60372cf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size < s->threshold_size) {
+        if (pending_size <= s->threshold_size) {
             qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
             pending_size = must_precopy + can_postcopy;
             trace_migrate_pending_exact(pending_size, must_precopy,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
@ 2026-03-20 12:26   ` Prasad Pandit
  2026-03-27 14:35     ` Juraj Marcin
  2026-03-27 15:05   ` Juraj Marcin
  1 sibling, 1 reply; 67+ messages in thread
From: Prasad Pandit @ 2026-03-20 12:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> When QEMU queried the estimated version of pending data and thinks it's
> ready to converge, it'll send another accurate query to make sure of it.
> It is needed to make sure we collect the latest reports and that equation
> still holds true.
>
> However we missed one tiny little difference here on "<" v.s. "<=" when
> comparing pending_size (A) to threshold_size (B)..
>
> QEMU src only re-query if A<B, but will kickoff switchover if A<=B.
>
> I think it means it is possible to happen if A (as an estimate only so far)
> accidentally equals to B, then re-query won't happen and switchover will
> proceed without considering new dirtied data.
>
> It turns out it was an accident in my commit 7aaa1fc072 when refactoring
> the code around.  Fix this by using the same equation in both places.
>
> Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58..dfc60372cf 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size < s->threshold_size) {
> +        if (pending_size <= s->threshold_size) {
>              qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
>              pending_size = must_precopy + can_postcopy;
>              trace_migrate_pending_exact(pending_size, must_precopy,

* What is the 'size' difference between < s->threshold_size  Vs  <=
s->threshold_size?  Going through the source IIUC
1) 'pending_size' is measured in Bytes.
     static void ram_state_pending_exact/_estimate()
         remaining_size = rs->migration_dirty_pages *
TARGET_PAGE_SIZE(=4096 bytes);
         100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
* 8 => 3,276,800 dirty bits

2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
downtime(=300 ms)
        100,000,000 bits/s => 100,000 bits/ms
        100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
        30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
        s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
transferred in 300ms downtime.

* Are we comparing pending_size(=409600 bytes)  <=
s->threshold_size(=30,000,000 bits)?

*  static void migration_update_counters()
        transferred = current_bytes - s->iteration_initial_bytes;
        bandwidth = (double)transferred / time_spent
        if (switchover_bw) {
            expected_bw_per_ms = (double)switchover_bw / 1000;
        } else {
            expected_bw_per_ms = bandwidth;
        }
=> ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?

      s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();

migration_iteration_run():
   /* Should we switch to postcopy now? */
   if (must_precopy <= s->threshold_size &&
      can_switchover && qatomic_read(&s->start_postcopy)) {
      if (postcopy_start(s, &local_err)) {
          migrate_error_propagate(s, error_copy(local_err));
          error_report_err(local_err);
      }
      return MIG_ITERATE_SKIP;
   }
* Here we should check pending_size <= s->threshold_size,  because
must_precopy is zero(0) when postcopy is enabled. And we switch to
postcopy mode even when pending_size > s->threshold_size.
  I wonder if we really need both 'must_precopy' and 'can_postcopy'
variables, they seem to complicate things.
===
# virsh migrate --verbose --live --auto-converge --postcopy
--postcopy-after-precopy f42vm
qemu+ssh://destination-machine.com/system
# less /var/log/libvirt/qemu/f42vm.log
...
migration_iteration_run: estimated pending_size: 50577408 bytes,
s->threshold_size: 36282361
migration_iteration_run: estimated pending_size: 43757568 bytes,
s->threshold_size: 36282361
migration_iteration_run: estimated pending_size: 36413440 bytes,
s->threshold_size: 34334680
migration_iteration_run: estimated pending_size: 29069312 bytes,
s->threshold_size: 34334680

migration_iteration_run: exact pending_size: 4339167232 bytes, 0,
4339167232              <== exact size is calculated once.
migration_iteration_run: estimated pending_size: 4332871680 bytes,
s->threshold_size: 35651363
migration_iteration_run: switching to postcopy: 4332871680, 0,
4332871680                    <== switch to postcopy with
must_precopy(=0) <= s->threshold_size

migration_iteration_run: estimated pending_size: 4332892160 bytes,
s->threshold_size: 35651363
migration_iteration_run: estimated pending_size: 4323188736 bytes,
s->threshold_size: 27243109
migration_iteration_run: estimated pending_size: 4315320320 bytes,
s->threshold_size: 27243109
migration_iteration_run: estimated pending_size: 4308221952 bytes,
s->threshold_size: 37695433
===
* Here, the exact pending_size is calculated only once, because we
switch to Postcopy mode even when pending_size is > s->threshold_size.

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-20 12:26   ` Prasad Pandit
@ 2026-03-27 14:35     ` Juraj Marcin
  2026-03-30 11:52       ` Prasad Pandit
  2026-04-01 19:11       ` Peter Xu
  0 siblings, 2 replies; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 14:35 UTC (permalink / raw)
  To: Prasad Pandit, Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

Hi Prasad,

On 2026-03-20 17:56, Prasad Pandit wrote:
> On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> > When QEMU queried the estimated version of pending data and thinks it's
> > ready to converge, it'll send another accurate query to make sure of it.
> > It is needed to make sure we collect the latest reports and that equation
> > still holds true.
> >
> > However we missed one tiny little difference here on "<" v.s. "<=" when
> > comparing pending_size (A) to threshold_size (B)..
> >
> > QEMU src only re-query if A<B, but will kickoff switchover if A<=B.
> >
> > I think it means it is possible to happen if A (as an estimate only so far)
> > accidentally equals to B, then re-query won't happen and switchover will
> > proceed without considering new dirtied data.
> >
> > It turns out it was an accident in my commit 7aaa1fc072 when refactoring
> > the code around.  Fix this by using the same equation in both places.
> >
> > Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
> > Cc: qemu-stable@nongnu.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58..dfc60372cf 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >           * postcopy started, so ESTIMATE should always match with EXACT
> >           * during postcopy phase.
> >           */
> > -        if (pending_size < s->threshold_size) {
> > +        if (pending_size <= s->threshold_size) {
> >              qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
> >              pending_size = must_precopy + can_postcopy;
> >              trace_migrate_pending_exact(pending_size, must_precopy,
> 
> * What is the 'size' difference between < s->threshold_size  Vs  <=
> s->threshold_size?  Going through the source IIUC
> 1) 'pending_size' is measured in Bytes.
>      static void ram_state_pending_exact/_estimate()
>          remaining_size = rs->migration_dirty_pages *
> TARGET_PAGE_SIZE(=4096 bytes);
>          100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
> * 8 => 3,276,800 dirty bits
> 
> 2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
> downtime(=300 ms)
>         100,000,000 bits/s => 100,000 bits/ms
>         100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
>         30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
>         s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
> transferred in 300ms downtime.
> 
> * Are we comparing pending_size(=409600 bytes)  <=
> s->threshold_size(=30,000,000 bits)?

While threshold_size is indeed derived from bandwidth, bandwidth is in
bytes:

    current_bytes = migration_transferred_bytes();
    transferred = current_bytes - s->iteration_initial_bytes;
    time_spent = current_time - s->iteration_start_time;
    bandwidth = (double)transferred / time_spent;

Conversion to bits only happens for the mbps statistic:

    s->mbps = (((double) transferred * 8.0) /
               ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;

> 
> *  static void migration_update_counters()
>         transferred = current_bytes - s->iteration_initial_bytes;
>         bandwidth = (double)transferred / time_spent
>         if (switchover_bw) {
>             expected_bw_per_ms = (double)switchover_bw / 1000;
>         } else {
>             expected_bw_per_ms = bandwidth;
>         }
> => ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?

switchover_bw is expected to be in bytes/sec, however, time_spent is
already in msec, thus bandwidth is also bytes/msec, the existing code is
correct.

@Peter, not sure if it is necessary, but it could be usefull to mention
in MigrationParameters docs, that avail-switchover-bandwidth is in
bytes, not bits?

> 
>       s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> 
> migration_iteration_run():
>    /* Should we switch to postcopy now? */
>    if (must_precopy <= s->threshold_size &&
>       can_switchover && qatomic_read(&s->start_postcopy)) {
>       if (postcopy_start(s, &local_err)) {
>           migrate_error_propagate(s, error_copy(local_err));
>           error_report_err(local_err);
>       }
>       return MIG_ITERATE_SKIP;
>    }
> * Here we should check pending_size <= s->threshold_size,  because
> must_precopy is zero(0) when postcopy is enabled. And we switch to
> postcopy mode even when pending_size > s->threshold_size.
>   I wonder if we really need both 'must_precopy' and 'can_postcopy'
> variables, they seem to complicate things.

With devices that implement pending method, don't support postcopy, and
are not yet migrated, must_precopy would not be zero.

Both, must_precopy and can_postcopy are required, that is what allows
postcopy to switchover early. pending_size is the overall total that
includes also postcopiable data, hence why it is only used to trigger
precopy completion.

However, the majority of devices don't implement pending methods (yet)
and thus are not counted towards the estimate even if they don't support
postcopy and affect the downtime.

Wondering if VMSD devices could implement some pending estimates based
on their defined fields, this would also improve not violating the
downtime requirements.

> ===
> # virsh migrate --verbose --live --auto-converge --postcopy
> --postcopy-after-precopy f42vm
> qemu+ssh://destination-machine.com/system
> # less /var/log/libvirt/qemu/f42vm.log
> ...
> migration_iteration_run: estimated pending_size: 50577408 bytes,
> s->threshold_size: 36282361
> migration_iteration_run: estimated pending_size: 43757568 bytes,
> s->threshold_size: 36282361
> migration_iteration_run: estimated pending_size: 36413440 bytes,
> s->threshold_size: 34334680
> migration_iteration_run: estimated pending_size: 29069312 bytes,
> s->threshold_size: 34334680
> 
> migration_iteration_run: exact pending_size: 4339167232 bytes, 0,
> 4339167232              <== exact size is calculated once.
> migration_iteration_run: estimated pending_size: 4332871680 bytes,
> s->threshold_size: 35651363
> migration_iteration_run: switching to postcopy: 4332871680, 0,
> 4332871680                    <== switch to postcopy with
> must_precopy(=0) <= s->threshold_size
> 
> migration_iteration_run: estimated pending_size: 4332892160 bytes,
> s->threshold_size: 35651363
> migration_iteration_run: estimated pending_size: 4323188736 bytes,
> s->threshold_size: 27243109
> migration_iteration_run: estimated pending_size: 4315320320 bytes,
> s->threshold_size: 27243109
> migration_iteration_run: estimated pending_size: 4308221952 bytes,
> s->threshold_size: 37695433
> ===
> * Here, the exact pending_size is calculated only once, because we
> switch to Postcopy mode even when pending_size is > s->threshold_size.
> 
> Thank you.
> ---
>   - Prasad
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-27 14:35     ` Juraj Marcin
@ 2026-03-30 11:52       ` Prasad Pandit
  2026-03-31 12:49         ` Juraj Marcin
  2026-04-01 19:11       ` Peter Xu
  1 sibling, 1 reply; 67+ messages in thread
From: Prasad Pandit @ 2026-03-30 11:52 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Peter Xu, qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

Hello Juraj,

On Fri, 27 Mar 2026 at 20:05, Juraj Marcin <jmarcin@redhat.com> wrote:
> > * What is the 'size' difference between < s->threshold_size  Vs  <=
> > s->threshold_size?  Going through the source IIUC
> > 1) 'pending_size' is measured in Bytes.
> >      static void ram_state_pending_exact/_estimate()
> >          remaining_size = rs->migration_dirty_pages *
> > TARGET_PAGE_SIZE(=4096 bytes);
> >          100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
> > * 8 => 3,276,800 dirty bits
> >
> > 2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
> > downtime(=300 ms)
> >         100,000,000 bits/s => 100,000 bits/ms
> >         100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
> >         30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
> >         s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
> > transferred in 300ms downtime.
> >
> > * Are we comparing pending_size(=409600 bytes)  <=
> > s->threshold_size(=30,000,000 bits)?
>
> While threshold_size is indeed derived from bandwidth, bandwidth is in
> bytes:
>
>     current_bytes = migration_transferred_bytes();
>     transferred = current_bytes - s->iteration_initial_bytes;
>     time_spent = current_time - s->iteration_start_time;
>     bandwidth = (double)transferred / time_spent;
>
> Conversion to bits only happens for the mbps statistic:
>
>     s->mbps = (((double) transferred * 8.0) /
>                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>
> >
> > *  static void migration_update_counters()
> >         transferred = current_bytes - s->iteration_initial_bytes;
> >         bandwidth = (double)transferred / time_spent
> >         if (switchover_bw) {
> >             expected_bw_per_ms = (double)switchover_bw / 1000;
> >         } else {
> >             expected_bw_per_ms = bandwidth;
> >         }
> > => ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?
>
> switchover_bw is expected to be in bytes/sec, however, time_spent is
> already in msec, thus bandwidth is also bytes/msec, the existing code is
> correct.

* I see, this is not readily clear though. This needs good
improvement. Maybe we should add an explanation in comments OR include
an example calculation OR define helper function(s) to convert
bandwidth from Mb/s <-> Mb/ms and MBps <-> Mbps etc.

>     s->mbps = (((double) transferred * 8.0) /
>                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;

* If time_spent is in msec, continuing the above 100 mbps example, we
got to 3,750,000 Bytes / 300 ms above. Now to get mbps

     s->mbps = (3,750,000 * 8.0)  / 300 (time_spent) bits/ms  =>
30,000,000 bits / 300ms
     s->mbps = 30,000,000 bits * 1000ms (=1sec) / 300ms  =>  100,000,000 mbps.

*  (time_spent / 1000.0) / 1000.0 / 1000.0  is not very apparent.

> @Peter, not sure if it is necessary, but it could be usefull to mention
> in MigrationParameters docs, that avail-switchover-bandwidth is in
> bytes, not bits?

* I request that we use 'Mbps' notation for bandwidth at every user
interface, be it --bandwidth <option> OR a definition in a
configuration file OR documentation. Users should always specify and
read bandwidth in 'Mbps'. Because that is the notation used
everywhere. Asking users to specify 100,000,000 bps / 8 => 12,500,000
Bytes/s  instead of 100Mbps does not seem right, is not user friendly.

> >
> >       s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> >
> > migration_iteration_run():
> >    /* Should we switch to postcopy now? */
> >    if (must_precopy <= s->threshold_size &&
> >       can_switchover && qatomic_read(&s->start_postcopy)) {
> >       if (postcopy_start(s, &local_err)) {
> >           migrate_error_propagate(s, error_copy(local_err));
> >           error_report_err(local_err);
> >       }
> >       return MIG_ITERATE_SKIP;
> >    }
> > * Here we should check pending_size <= s->threshold_size,  because
> > must_precopy is zero(0) when postcopy is enabled. And we switch to
> > postcopy mode even when pending_size > s->threshold_size.
> >   I wonder if we really need both 'must_precopy' and 'can_postcopy'
> > variables, they seem to complicate things.
>
> With devices that implement pending method, don't support postcopy, and
> are not yet migrated, must_precopy would not be zero.

* If must_precopy is not zero(0), then pending_data would not be <
threshold_size either, right? What is must_precopy data? IIUC, all
device state is 'must_precopy' data, because we don't send Postcopy
requests for device state data. And can_postcopy data is only RAM
pages.

> Both, must_precopy and can_postcopy are required, that is what allows
> postcopy to switchover early. pending_size is the overall total that
> includes also postcopiable data, hence why it is only used to trigger
> precopy completion.
>
> However, the majority of devices don't implement pending methods (yet)
> and thus are not counted towards the estimate even if they don't support
> postcopy and affect the downtime.
>
> Wondering if VMSD devices could implement some pending estimates based
> on their defined fields, this would also improve not violating the
> downtime requirements.

* Using 'pending_size <= threshold_size'  in one place and using
'must_precopy <= threshold_size' in another place is confusing and
inconsistent.

* What is threshold_size really? Number of bits/bytes we _can_
transfer within the given downtime, right? if downtime is 300ms, at
100 mbps, threshold_size is 30Mb (ie. 10Mb/100ms). For 500ms,
threshold_size is 50Mb.

* Then it is easy to understand that when pending_size is <=
threshold_size, then we can easily pause the source VM and switch to
Postcopy, becasue those pending_size bits/bytes can be transferred in
the given downtime. And if pending_size is more, then we _can not_
pause the source VM, because we can not transfer more pending data
within the given downtime value.


Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-30 11:52       ` Prasad Pandit
@ 2026-03-31 12:49         ` Juraj Marcin
  2026-04-06  7:21           ` Prasad Pandit
  0 siblings, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-03-31 12:49 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: Peter Xu, qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

Hi Prasad,

On 2026-03-30 17:22, Prasad Pandit wrote:
> Hello Juraj,
> 
> On Fri, 27 Mar 2026 at 20:05, Juraj Marcin <jmarcin@redhat.com> wrote:
> > > * What is the 'size' difference between < s->threshold_size  Vs  <=
> > > s->threshold_size?  Going through the source IIUC
> > > 1) 'pending_size' is measured in Bytes.
> > >      static void ram_state_pending_exact/_estimate()
> > >          remaining_size = rs->migration_dirty_pages *
> > > TARGET_PAGE_SIZE(=4096 bytes);
> > >          100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
> > > * 8 => 3,276,800 dirty bits
> > >
> > > 2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
> > > downtime(=300 ms)
> > >         100,000,000 bits/s => 100,000 bits/ms
> > >         100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
> > >         30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
> > >         s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
> > > transferred in 300ms downtime.
> > >
> > > * Are we comparing pending_size(=409600 bytes)  <=
> > > s->threshold_size(=30,000,000 bits)?
> >
> > While threshold_size is indeed derived from bandwidth, bandwidth is in
> > bytes:
> >
> >     current_bytes = migration_transferred_bytes();
> >     transferred = current_bytes - s->iteration_initial_bytes;
> >     time_spent = current_time - s->iteration_start_time;
> >     bandwidth = (double)transferred / time_spent;
> >
> > Conversion to bits only happens for the mbps statistic:
> >
> >     s->mbps = (((double) transferred * 8.0) /
> >                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
> >
> > >
> > > *  static void migration_update_counters()
> > >         transferred = current_bytes - s->iteration_initial_bytes;
> > >         bandwidth = (double)transferred / time_spent
> > >         if (switchover_bw) {
> > >             expected_bw_per_ms = (double)switchover_bw / 1000;
> > >         } else {
> > >             expected_bw_per_ms = bandwidth;
> > >         }
> > > => ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?
> >
> > switchover_bw is expected to be in bytes/sec, however, time_spent is
> > already in msec, thus bandwidth is also bytes/msec, the existing code is
> > correct.
> 
> * I see, this is not readily clear though. This needs good
> improvement. Maybe we should add an explanation in comments OR include
> an example calculation OR define helper function(s) to convert
> bandwidth from Mb/s <-> Mb/ms and MBps <-> Mbps etc.
> 
> >     s->mbps = (((double) transferred * 8.0) /
> >                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
> 
> * If time_spent is in msec, continuing the above 100 mbps example, we
> got to 3,750,000 Bytes / 300 ms above. Now to get mbps
> 
>      s->mbps = (3,750,000 * 8.0)  / 300 (time_spent) bits/ms  =>
> 30,000,000 bits / 300ms
>      s->mbps = 30,000,000 bits * 1000ms (=1sec) / 300ms  =>  100,000,000 mbps.
> 
> *  (time_spent / 1000.0) / 1000.0 / 1000.0  is not very apparent.
> 
> > @Peter, not sure if it is necessary, but it could be usefull to mention
> > in MigrationParameters docs, that avail-switchover-bandwidth is in
> > bytes, not bits?
> 
> * I request that we use 'Mbps' notation for bandwidth at every user
> interface, be it --bandwidth <option> OR a definition in a
> configuration file OR documentation. Users should always specify and
> read bandwidth in 'Mbps'. Because that is the notation used
> everywhere. Asking users to specify 100,000,000 bps / 8 => 12,500,000
> Bytes/s  instead of 100Mbps does not seem right, is not user friendly.

QAPI schema defines the type of 'avail-switchover-bandwidth' and other
'*-bandwidth' properties as size, which allows users to use size suffix,
for example '100M'. QAPI parser then parses this value to
100 * 1000 * 1000 bytes.

Changing from bytes to bits is API breaking change, I don't think it is
justified.

> 
> > >
> > >       s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> > >
> > > migration_iteration_run():
> > >    /* Should we switch to postcopy now? */
> > >    if (must_precopy <= s->threshold_size &&
> > >       can_switchover && qatomic_read(&s->start_postcopy)) {
> > >       if (postcopy_start(s, &local_err)) {
> > >           migrate_error_propagate(s, error_copy(local_err));
> > >           error_report_err(local_err);
> > >       }
> > >       return MIG_ITERATE_SKIP;
> > >    }
> > > * Here we should check pending_size <= s->threshold_size,  because
> > > must_precopy is zero(0) when postcopy is enabled. And we switch to
> > > postcopy mode even when pending_size > s->threshold_size.
> > >   I wonder if we really need both 'must_precopy' and 'can_postcopy'
> > > variables, they seem to complicate things.
> >
> > With devices that implement pending method, don't support postcopy, and
> > are not yet migrated, must_precopy would not be zero.
> 
> * If must_precopy is not zero(0), then pending_data would not be <
> threshold_size either, right? What is must_precopy data? IIUC, all
> device state is 'must_precopy' data, because we don't send Postcopy
> requests for device state data. And can_postcopy data is only RAM
> pages.
> 
> > Both, must_precopy and can_postcopy are required, that is what allows
> > postcopy to switchover early. pending_size is the overall total that
> > includes also postcopiable data, hence why it is only used to trigger
> > precopy completion.
> >
> > However, the majority of devices don't implement pending methods (yet)
> > and thus are not counted towards the estimate even if they don't support
> > postcopy and affect the downtime.
> >
> > Wondering if VMSD devices could implement some pending estimates based
> > on their defined fields, this would also improve not violating the
> > downtime requirements.
> 
> * Using 'pending_size <= threshold_size'  in one place and using
> 'must_precopy <= threshold_size' in another place is confusing and
> inconsistent.
> 
> * What is threshold_size really? Number of bits/bytes we _can_
> transfer within the given downtime, right? if downtime is 300ms, at
> 100 mbps, threshold_size is 30Mb (ie. 10Mb/100ms). For 500ms,
> threshold_size is 50Mb.

Yes, that is right.

> 
> * Then it is easy to understand that when pending_size is <=
> threshold_size, then we can easily pause the source VM and switch to
> Postcopy, becasue those pending_size bits/bytes can be transferred in
> the given downtime. And if pending_size is more, then we _can not_
> pause the source VM, because we can not transfer more pending data
> within the given downtime value.

Devices report two pending values, 'must_precopy' and 'can_postcopy'.
'must_precopy' is number of bytes that must be migrated in precopy (or
switchover) and cannot be postcopied. 'can_postcopy' is number of bytes
that can be migrated either in precopy or postcopy, those can be RAM
pages (with postcopy-ram enabled) or dirty bitmaps or any device that
would support postcopy.

If switching to postcopy is allowed, we only need to be able to transfer
'must_precopy' bytes in switchover. Rest of the pending data can be
migrated in postcopy ('can_postcopy' bytes). Thus, we only compare
'must_precopy' when deciding whether to switch to postcopy.

However, if postcopy isn't allowed, and we are deciding whether to stop
the source machine and complete the migration, we need to migrate
everything in switchover, including the data that could be postcopied.
Thus, we need to compare 'pending_size = must_precopy + can_postcopy' to
decide if we can complete the migration in the specified downtime.

> 
> 
> Thank you.
> ---
>   - Prasad
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-31 12:49         ` Juraj Marcin
@ 2026-04-06  7:21           ` Prasad Pandit
  0 siblings, 0 replies; 67+ messages in thread
From: Prasad Pandit @ 2026-04-06  7:21 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Peter Xu, qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

Hi,

On Tue, 31 Mar 2026 at 18:19, Juraj Marcin <jmarcin@redhat.com> wrote:
> QAPI schema defines the type of 'avail-switchover-bandwidth' and other
> '*-bandwidth' properties as size, which allows users to use size suffix,
> for example '100M'. QAPI parser then parses this value to
> 100 * 1000 * 1000 bytes.

* That's nice, and common with other tools: dc(1), wget(1), curl(1).
It still doesn't save user the trouble of converting 100 Mbps ->
12.5MBps. If we check wget(1) and curl(1), their '--limit-rate' option
accepts values as Bytes/Sec, but that is usage limit user wants to
impose, it does not entail bits <-> bytes conversion. If the intended
use of '--available-switchover-bandwidth' option is analogous to the
'--limit-rate' option, then naming it --available-*bandwidth is
confusing OR wrong.

* The contention here is the deviation from the norm of specifying
bandwidth in 'bits-per-second'. To accommodate it we add conversion
arithmetic and (miss) documentation, etc. Above all it breaks user
intuition.

> Changing from bytes to bits is API breaking change, I don't think it is justified.

* The sanctity of APIs is important. But that should not prevent us
from improving the very APIs.

> Devices report two pending values, 'must_precopy' and 'can_postcopy'.
> 'must_precopy' is number of bytes that must be migrated in precopy (or
> switchover) and cannot be postcopied. 'can_postcopy' is number of bytes
> that can be migrated either in precopy or postcopy, those can be RAM
> pages (with postcopy-ram enabled) or dirty bitmaps or any device that
> would support postcopy.
>
> If switching to postcopy is allowed, we only need to be able to transfer
> 'must_precopy' bytes in switchover. Rest of the pending data can be
> migrated in postcopy ('can_postcopy' bytes). Thus, we only compare
> 'must_precopy' when deciding whether to switch to postcopy.
>
> However, if postcopy isn't allowed, and we are deciding whether to stop
> the source machine and complete the migration, we need to migrate
> everything in switchover, including the data that could be postcopied.
> Thus, we need to compare 'pending_size = must_precopy + can_postcopy' to
> decide if we can complete the migration in the specified downtime.

* Right. Is it fair to say:

        /* Return pending data size to decide if we can pause/stop the source VM
         * and start the VM on the destination side */
        uint64_t qemu_savevm_state_pending_estimate/exact( ... )
        {
                 uint64_t must_precopy = 0, can_postcopy = 0;
                 ...
                 if (migrate_postcopy())
                     return must_precopy;

                 return must_precopy + can_postcopy;
        }

       pending_size = qemu_savevm_state_pending_estimate/exact( ... );

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-27 14:35     ` Juraj Marcin
  2026-03-30 11:52       ` Prasad Pandit
@ 2026-04-01 19:11       ` Peter Xu
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-01 19:11 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: Prasad Pandit, qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

On Fri, Mar 27, 2026 at 03:35:35PM +0100, Juraj Marcin wrote:
> @Peter, not sure if it is necessary, but it could be usefull to mention
> in MigrationParameters docs, that avail-switchover-bandwidth is in
> bytes, not bits?

I agree, I will add a patch to update the doc.  Thanks for jumping in,
Juraj; I apprecate that.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
  2026-03-20 12:26   ` Prasad Pandit
@ 2026-03-27 15:05   ` Juraj Marcin
  1 sibling, 0 replies; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 15:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	qemu-stable

On 2026-03-19 19:12, Peter Xu wrote:
> When QEMU queried the estimated version of pending data and thinks it's
> ready to converge, it'll send another accurate query to make sure of it.
> It is needed to make sure we collect the latest reports and that equation
> still holds true.
> 
> However we missed one tiny little difference here on "<" v.s. "<=" when
> comparing pending_size (A) to threshold_size (B)..
> 
> QEMU src only re-query if A<B, but will kickoff switchover if A<=B.
> 
> I think it means it is possible to happen if A (as an estimate only so far)
> accidentally equals to B, then re-query won't happen and switchover will
> proceed without considering new dirtied data.
> 
> It turns out it was an accident in my commit 7aaa1fc072 when refactoring
> the code around.  Fix this by using the same equation in both places.
> 
> Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58..dfc60372cf 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size < s->threshold_size) {
> +        if (pending_size <= s->threshold_size) {
>              qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
>              pending_size = must_precopy + can_postcopy;
>              trace_migrate_pending_exact(pending_size, must_precopy,
> -- 
> 2.50.1
> 

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-19 23:26   ` Peter Xu
  2026-03-20  6:54   ` Markus Armbruster
  2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Libvirt Mailing List

This stats is only about RAM, so make it accurate.  This paves way for
statistics for all devices.

Note that even if this is part of qapi/, this should not be part of ABI of
at least query-migrate, because the structure is not changed, and this
stats is always reported only under the "ram" section.

Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Libvirt Mailing List <libvir-list@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/about/removed-features.rst | 2 +-
 qapi/migration.json             | 8 ++++----
 migration/migration-stats.h     | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index 6f4447993c..7c4f4325f7 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -699,7 +699,7 @@ was superseded by ``sections``.
 ``query-migrate`` return value member ``skipped`` (removed in 9.1)
 ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 
-Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
+Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
 for more than 10 years. Removed with no replacement.
 
 ``migrate`` command option ``inc`` (removed in 9.1)
diff --git a/qapi/migration.json b/qapi/migration.json
index 7134d4ce47..cfc6ccee26 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -12,7 +12,7 @@
 { 'include': 'sockets.json' }
 
 ##
-# @MigrationStats:
+# @MigrationRAMStats:
 #
 # Detailed migration status.
 #
@@ -64,7 +64,7 @@
 #
 # Since: 0.14
 ##
-{ 'struct': 'MigrationStats',
+{ 'struct': 'MigrationRAMStats',
   'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
            'duplicate': 'int',
            'normal': 'int',
@@ -209,7 +209,7 @@
 #     If this field is not returned, no migration process has been
 #     initiated
 #
-# @ram: `MigrationStats` containing detailed migration status, only
+# @ram: `MigrationRAMStats` containing detailed migration status, only
 #     returned if status is 'active' or 'completed'(since 1.2)
 #
 # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
@@ -309,7 +309,7 @@
 # Since: 0.14
 ##
 { 'struct': 'MigrationInfo',
-  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
+  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
            '*vfio': 'VfioStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index c0f50144c9..1153520f7a 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -27,7 +27,7 @@
 
 /*
  * These are the ram migration statistic counters.  It is loosely
- * based on MigrationStats.
+ * based on MigrationRAMStats.
  */
 typedef struct {
     /*
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
@ 2026-03-19 23:26   ` Peter Xu
  2026-03-20  6:54   ` Markus Armbruster
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Libvirt Mailing List

On Thu, Mar 19, 2026 at 07:12:52PM -0400, Peter Xu wrote:
> This stats is only about RAM, so make it accurate.  This paves way for
> statistics for all devices.
> 
> Note that even if this is part of qapi/, this should not be part of ABI of
> at least query-migrate, because the structure is not changed, and this
> stats is always reported only under the "ram" section.
> 
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Libvirt Mailing List <libvir-list@redhat.com>

I got the address wrong..  Will update when I repost.  Copying the correct
one.

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/about/removed-features.rst | 2 +-
>  qapi/migration.json             | 8 ++++----
>  migration/migration-stats.h     | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> index 6f4447993c..7c4f4325f7 100644
> --- a/docs/about/removed-features.rst
> +++ b/docs/about/removed-features.rst
> @@ -699,7 +699,7 @@ was superseded by ``sections``.
>  ``query-migrate`` return value member ``skipped`` (removed in 9.1)
>  ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
>  
> -Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
> +Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
>  for more than 10 years. Removed with no replacement.
>  
>  ``migrate`` command option ``inc`` (removed in 9.1)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 7134d4ce47..cfc6ccee26 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -12,7 +12,7 @@
>  { 'include': 'sockets.json' }
>  
>  ##
> -# @MigrationStats:
> +# @MigrationRAMStats:
>  #
>  # Detailed migration status.
>  #
> @@ -64,7 +64,7 @@
>  #
>  # Since: 0.14
>  ##
> -{ 'struct': 'MigrationStats',
> +{ 'struct': 'MigrationRAMStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int',
>             'normal': 'int',
> @@ -209,7 +209,7 @@
>  #     If this field is not returned, no migration process has been
>  #     initiated
>  #
> -# @ram: `MigrationStats` containing detailed migration status, only
> +# @ram: `MigrationRAMStats` containing detailed migration status, only
>  #     returned if status is 'active' or 'completed'(since 1.2)
>  #
>  # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
> @@ -309,7 +309,7 @@
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationInfo',
> -  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> +  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
>             '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index c0f50144c9..1153520f7a 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -27,7 +27,7 @@
>  
>  /*
>   * These are the ram migration statistic counters.  It is loosely
> - * based on MigrationStats.
> + * based on MigrationRAMStats.
>   */
>  typedef struct {
>      /*
> -- 
> 2.50.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
  2026-03-19 23:26   ` Peter Xu
@ 2026-03-20  6:54   ` Markus Armbruster
  2026-04-01 19:38     ` Peter Xu
  2026-04-01 19:47     ` Peter Xu
  1 sibling, 2 replies; 67+ messages in thread
From: Markus Armbruster @ 2026-03-20  6:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Avihai Horon, Cédric Le Goater, devel

Cc: correct libvirt list address

Peter Xu <peterx@redhat.com> writes:

> This stats is only about RAM, so make it accurate.  This paves way for
> statistics for all devices.
>
> Note that even if this is part of qapi/, this should not be part of ABI of

docs/devel/qapi-code-gen.rst section "Compatibility considerations":

    Since type names are not visible in the Client JSON Protocol, types
    may be freely renamed.  Even certain refactorings are invisible, such
    as splitting members from one type into a common base type.

So, s/should not be/is not/.

> at least query-migrate, because the structure is not changed, and this
> stats is always reported only under the "ram" section.

Why "at least query-migrate"?  Do you have other interfaces in mind, or
are you just hedging?

> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Libvirt Mailing List <libvir-list@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/about/removed-features.rst | 2 +-
>  qapi/migration.json             | 8 ++++----
>  migration/migration-stats.h     | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> index 6f4447993c..7c4f4325f7 100644
> --- a/docs/about/removed-features.rst
> +++ b/docs/about/removed-features.rst
> @@ -699,7 +699,7 @@ was superseded by ``sections``.
>  ``query-migrate`` return value member ``skipped`` (removed in 9.1)
>  ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
>  
> -Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
> +Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
>  for more than 10 years. Removed with no replacement.

Could simply use "of the return value".  But this is fine, too.

>  
>  ``migrate`` command option ``inc`` (removed in 9.1)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 7134d4ce47..cfc6ccee26 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -12,7 +12,7 @@
>  { 'include': 'sockets.json' }
>  
>  ##
> -# @MigrationStats:
> +# @MigrationRAMStats:
>  #
>  # Detailed migration status.
>  #
> @@ -64,7 +64,7 @@
>  #
>  # Since: 0.14
>  ##
> -{ 'struct': 'MigrationStats',
> +{ 'struct': 'MigrationRAMStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int',
>             'normal': 'int',
> @@ -209,7 +209,7 @@
>  #     If this field is not returned, no migration process has been
>  #     initiated
>  #
> -# @ram: `MigrationStats` containing detailed migration status, only
> +# @ram: `MigrationRAMStats` containing detailed migration status, only
>  #     returned if status is 'active' or 'completed'(since 1.2)

This gets rendered like

      * ram (MigrationStats, *optional*) -- MigrationStats
        containing detailed migration status, only returned if status
        is 'active' or 'completed'(since 1.2)

Recommend to rephrase to avoid the repetition.  Maybe

   # @ram: Detailed migration RAM status, only returned if status is
   #     'active' or 'completed' (since 1.2)

Also consider replacing "status" by "statistics".

>  #
>  # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
> @@ -309,7 +309,7 @@
>  # Since: 0.14
>  ##
>  { 'struct': 'MigrationInfo',
> -  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> +  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
>             '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index c0f50144c9..1153520f7a 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -27,7 +27,7 @@
>  
>  /*
>   * These are the ram migration statistic counters.  It is loosely
> - * based on MigrationStats.
> + * based on MigrationRAMStats.
>   */
>  typedef struct {
>      /*

Only minor issues, and none added by this patch, so
Acked-by: Markus Armbruster <armbru@redhat.com>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-20  6:54   ` Markus Armbruster
@ 2026-04-01 19:38     ` Peter Xu
  2026-04-01 19:47     ` Peter Xu
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-01 19:38 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Avihai Horon, Cédric Le Goater, devel

On Fri, Mar 20, 2026 at 07:54:38AM +0100, Markus Armbruster wrote:
> Cc: correct libvirt list address
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > This stats is only about RAM, so make it accurate.  This paves way for
> > statistics for all devices.
> >
> > Note that even if this is part of qapi/, this should not be part of ABI of
> 
> docs/devel/qapi-code-gen.rst section "Compatibility considerations":
> 
>     Since type names are not visible in the Client JSON Protocol, types
>     may be freely renamed.  Even certain refactorings are invisible, such
>     as splitting members from one type into a common base type.
> 
> So, s/should not be/is not/.
> 
> > at least query-migrate, because the structure is not changed, and this
> > stats is always reported only under the "ram" section.
> 
> Why "at least query-migrate"?  Do you have other interfaces in mind, or
> are you just hedging?

I wanted to say the QMP command that consumes this structure will not be
affected.  Happy to learn that it's explicitly described in qapi document,
thanks for the pointer!  I'll also add this reference in the commit
message.

> 
> > Cc: Daniel P. Berrangé <berrange@redhat.com>
> > Cc: Markus Armbruster <armbru@redhat.com>
> > Cc: Libvirt Mailing List <libvir-list@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  docs/about/removed-features.rst | 2 +-
> >  qapi/migration.json             | 8 ++++----
> >  migration/migration-stats.h     | 2 +-
> >  3 files changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
> > index 6f4447993c..7c4f4325f7 100644
> > --- a/docs/about/removed-features.rst
> > +++ b/docs/about/removed-features.rst
> > @@ -699,7 +699,7 @@ was superseded by ``sections``.
> >  ``query-migrate`` return value member ``skipped`` (removed in 9.1)
> >  ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
> >  
> > -Member ``skipped`` of the ``MigrationStats`` struct hasn't been used
> > +Member ``skipped`` of the ``MigrationRAMStats`` struct hasn't been used
> >  for more than 10 years. Removed with no replacement.
> 
> Could simply use "of the return value".  But this is fine, too.

Let me keep this as-is then; the plan is to make this patch as simple as a
rename effort, as it's not accurate anymore (e.g. dirty-sync-count should
be a global statistic rather than RAM-only), and it also makes that name
available for system-wise statistics.

> 
> >  
> >  ``migrate`` command option ``inc`` (removed in 9.1)
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 7134d4ce47..cfc6ccee26 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -12,7 +12,7 @@
> >  { 'include': 'sockets.json' }
> >  
> >  ##
> > -# @MigrationStats:
> > +# @MigrationRAMStats:
> >  #
> >  # Detailed migration status.
> >  #
> > @@ -64,7 +64,7 @@
> >  #
> >  # Since: 0.14
> >  ##
> > -{ 'struct': 'MigrationStats',
> > +{ 'struct': 'MigrationRAMStats',
> >    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
> >             'duplicate': 'int',
> >             'normal': 'int',
> > @@ -209,7 +209,7 @@
> >  #     If this field is not returned, no migration process has been
> >  #     initiated
> >  #
> > -# @ram: `MigrationStats` containing detailed migration status, only
> > +# @ram: `MigrationRAMStats` containing detailed migration status, only
> >  #     returned if status is 'active' or 'completed'(since 1.2)
> 
> This gets rendered like
> 
>       * ram (MigrationStats, *optional*) -- MigrationStats
>         containing detailed migration status, only returned if status
>         is 'active' or 'completed'(since 1.2)
> 
> Recommend to rephrase to avoid the repetition.  Maybe
> 
>    # @ram: Detailed migration RAM status, only returned if status is
>    #     'active' or 'completed' (since 1.2)
> 
> Also consider replacing "status" by "statistics".

Sure, I'll touch it up while at this.

> 
> >  #
> >  # @xbzrle-cache: `XBZRLECacheStats` containing detailed XBZRLE
> > @@ -309,7 +309,7 @@
> >  # Since: 0.14
> >  ##
> >  { 'struct': 'MigrationInfo',
> > -  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> > +  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationRAMStats',
> >             '*vfio': 'VfioStats',
> >             '*xbzrle-cache': 'XBZRLECacheStats',
> >             '*total-time': 'int',
> > diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> > index c0f50144c9..1153520f7a 100644
> > --- a/migration/migration-stats.h
> > +++ b/migration/migration-stats.h
> > @@ -27,7 +27,7 @@
> >  
> >  /*
> >   * These are the ram migration statistic counters.  It is loosely
> > - * based on MigrationStats.
> > + * based on MigrationRAMStats.
> >   */
> >  typedef struct {
> >      /*
> 
> Only minor issues, and none added by this patch, so
> Acked-by: Markus Armbruster <armbru@redhat.com>

It took me a while to be able to go back to this series, thanks for the
super fast feedback!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats
  2026-03-20  6:54   ` Markus Armbruster
  2026-04-01 19:38     ` Peter Xu
@ 2026-04-01 19:47     ` Peter Xu
  1 sibling, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-01 19:47 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Avihai Horon, Cédric Le Goater, devel

On Fri, Mar 20, 2026 at 07:54:38AM +0100, Markus Armbruster wrote:
> Recommend to rephrase to avoid the repetition.  Maybe
> 
>    # @ram: Detailed migration RAM status, only returned if status is
>    #     'active' or 'completed' (since 1.2)

I just noticed the previous document was also wrong.. it was too ancient
since 1.2, and we introduced too many new status that also reports RAM
statistics..  I changed it to below:

# @ram: Detailed migration RAM statistics, only returned if migration
#     is in progress or completed (since 1.2)

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
  2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-25 14:10   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

During precopy phase, VFIO maintains two counters for init/dirty data
tracking for query estimations.

VFIO fetches data during precopy by reading from the VFIO fd, after
fetching it'll deduct the read size.

Here since the fd's size can dynamically change, I think it means VFIO may
read more than what it "thought" were there for fetching.

I highly suspect it's also relevant to a weird case in the function of
vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
it will _reset_ the two counters, instead of asserting both of them being
zeros, which looks pretty hackish.

Just guarantee it from userspace level that VFIO won't read more than what
it expects for now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/migration.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 83327b6573..851ea783f3 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -357,12 +357,18 @@ static int vfio_query_precopy_size(VFIOMigration *migration)
 }
 
 /* Returns the size of saved data on success and -errno on error */
-static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration,
+                               bool precopy)
 {
-    ssize_t data_size;
+    ssize_t data_size = migration->data_buffer_size;
+
+    if (precopy) {
+        /* Limit the buffer size to make sure cached stats don't overflow */
+        data_size = MIN(data_size, migration->precopy_init_size +
+                        migration->precopy_dirty_size);
+    }
 
-    data_size = read(migration->data_fd, migration->data_buffer,
-                     migration->data_buffer_size);
+    data_size = read(migration->data_fd, migration->data_buffer, data_size);
     if (data_size < 0) {
         /*
          * Pre-copy emptied all the device state for now. For more information,
@@ -623,7 +629,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
         migration->event_save_iterate_started = true;
     }
 
-    data_size = vfio_save_block(f, migration);
+    data_size = vfio_save_block(f, migration, true);
     if (data_size < 0) {
         return data_size;
     }
@@ -667,7 +673,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     }
 
     do {
-        data_size = vfio_save_block(f, vbasedev->migration);
+        data_size = vfio_save_block(f, vbasedev->migration, false);
         if (data_size < 0) {
             return data_size;
         }
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
@ 2026-03-25 14:10   ` Avihai Horon
  2026-04-01 20:36     ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-03-25 14:10 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

Hi Peter,

Thanks for sending this series.

On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> During precopy phase, VFIO maintains two counters for init/dirty data
> tracking for query estimations.
>
> VFIO fetches data during precopy by reading from the VFIO fd, after
> fetching it'll deduct the read size.
>
> Here since the fd's size can dynamically change, I think it means VFIO may
> read more than what it "thought" were there for fetching.
>
> I highly suspect it's also relevant to a weird case in the function of
> vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
> it will _reset_ the two counters, instead of asserting both of them being
> zeros, which looks pretty hackish.
>
> Just guarantee it from userspace level that VFIO won't read more than what
> it expects for now.

The VFIO_MIG_GET_PRECOPY_INFO ioctl returns an estimation of the data 
size currently available for reading. So, even if the ioctl returns X 
bytes, it may be that there are more than X bytes to read or less than X 
bytes.
The code was written in a flexible way to handle such discrepancies.

Because we are dealing with an estimation, I don't think we can assert 
that the counters are zero, and I don't think reading only up to the 
cached size gives us any benefit:
If the estimation is lower than actual available data, we are just 
deferring sending of the remaining data to a later stage.
If the estimation is higher than actual available data, we may still 
read() zero when the cached values are not zero.

I think we should keep the code as is.

Does that make sense?

Thanks.

>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/migration.c | 18 ++++++++++++------
>   1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 83327b6573..851ea783f3 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -357,12 +357,18 @@ static int vfio_query_precopy_size(VFIOMigration *migration)
>   }
>
>   /* Returns the size of saved data on success and -errno on error */
> -static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> +static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration,
> +                               bool precopy)
>   {
> -    ssize_t data_size;
> +    ssize_t data_size = migration->data_buffer_size;
> +
> +    if (precopy) {
> +        /* Limit the buffer size to make sure cached stats don't overflow */
> +        data_size = MIN(data_size, migration->precopy_init_size +
> +                        migration->precopy_dirty_size);
> +    }
>
> -    data_size = read(migration->data_fd, migration->data_buffer,
> -                     migration->data_buffer_size);
> +    data_size = read(migration->data_fd, migration->data_buffer, data_size);
>       if (data_size < 0) {
>           /*
>            * Pre-copy emptied all the device state for now. For more information,
> @@ -623,7 +629,7 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>           migration->event_save_iterate_started = true;
>       }
>
> -    data_size = vfio_save_block(f, migration);
> +    data_size = vfio_save_block(f, migration, true);
>       if (data_size < 0) {
>           return data_size;
>       }
> @@ -667,7 +673,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       }
>
>       do {
> -        data_size = vfio_save_block(f, vbasedev->migration);
> +        data_size = vfio_save_block(f, vbasedev->migration, false);
>           if (data_size < 0) {
>               return data_size;
>           }
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-03-25 14:10   ` Avihai Horon
@ 2026-04-01 20:36     ` Peter Xu
  2026-04-06 11:21       ` Avihai Horon
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-01 20:36 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Wed, Mar 25, 2026 at 04:10:14PM +0200, Avihai Horon wrote:
> Hi Peter,

Avihai,

> 
> Thanks for sending this series.

Thanks for taking a look.

> 
> On 3/20/2026 1:12 AM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > During precopy phase, VFIO maintains two counters for init/dirty data
> > tracking for query estimations.
> > 
> > VFIO fetches data during precopy by reading from the VFIO fd, after
> > fetching it'll deduct the read size.
> > 
> > Here since the fd's size can dynamically change, I think it means VFIO may
> > read more than what it "thought" were there for fetching.
> > 
> > I highly suspect it's also relevant to a weird case in the function of
> > vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
> > it will _reset_ the two counters, instead of asserting both of them being
> > zeros, which looks pretty hackish.
> > 
> > Just guarantee it from userspace level that VFIO won't read more than what
> > it expects for now.
> 
> The VFIO_MIG_GET_PRECOPY_INFO ioctl returns an estimation of the data size
> currently available for reading. So, even if the ioctl returns X bytes, it
> may be that there are more than X bytes to read or less than X bytes.
> The code was written in a flexible way to handle such discrepancies.
> 
> Because we are dealing with an estimation, I don't think we can assert that
> the counters are zero, and I don't think reading only up to the cached size
> gives us any benefit:
> If the estimation is lower than actual available data, we are just deferring
> sending of the remaining data to a later stage.

Since we'll introduce cached size, having the read() only happen with the
size reported still makes sense to me.

We're not deferring to later that much, when dirty data reaches zero, we'll
re-sync with everything including VFIO's VFIO_MIG_GET_PRECOPY_INFO.  So
it's just splitting one last-phase read() into two smaller read()s.  To me,
it sounds still OK if with that we can make sure the counter won't overflow.

> If the estimation is higher than actual available data, we may still read()
> zero when the cached values are not zero.
> 
> I think we should keep the code as is.
> 
> Does that make sense?

I can understand what got reported in VFIO_MIG_GET_PRECOPY_INFO may not be
the total size of dirty data, but what the userapp can read.  That part is
fine.

Now, do you mean the size reported could shrink as well?  Could you explain
why, and when, dirtied data size can shrink?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-04-01 20:36     ` Peter Xu
@ 2026-04-06 11:21       ` Avihai Horon
  2026-04-07 15:18         ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-04-06 11:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 4/1/2026 11:36 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Mar 25, 2026 at 04:10:14PM +0200, Avihai Horon wrote:
>> Hi Peter,
> Avihai,
>
>> Thanks for sending this series.
> Thanks for taking a look.
>
>> On 3/20/2026 1:12 AM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> During precopy phase, VFIO maintains two counters for init/dirty data
>>> tracking for query estimations.
>>>
>>> VFIO fetches data during precopy by reading from the VFIO fd, after
>>> fetching it'll deduct the read size.
>>>
>>> Here since the fd's size can dynamically change, I think it means VFIO may
>>> read more than what it "thought" were there for fetching.
>>>
>>> I highly suspect it's also relevant to a weird case in the function of
>>> vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
>>> it will _reset_ the two counters, instead of asserting both of them being
>>> zeros, which looks pretty hackish.
>>>
>>> Just guarantee it from userspace level that VFIO won't read more than what
>>> it expects for now.
>> The VFIO_MIG_GET_PRECOPY_INFO ioctl returns an estimation of the data size
>> currently available for reading. So, even if the ioctl returns X bytes, it
>> may be that there are more than X bytes to read or less than X bytes.
>> The code was written in a flexible way to handle such discrepancies.
>>
>> Because we are dealing with an estimation, I don't think we can assert that
>> the counters are zero, and I don't think reading only up to the cached size
>> gives us any benefit:
>> If the estimation is lower than actual available data, we are just deferring
>> sending of the remaining data to a later stage.
> Since we'll introduce cached size, having the read() only happen with the
> size reported still makes sense to me.
>
> We're not deferring to later that much, when dirty data reaches zero, we'll
> re-sync with everything including VFIO's VFIO_MIG_GET_PRECOPY_INFO.  So
> it's just splitting one last-phase read() into two smaller read()s.  To me,
> it sounds still OK if with that we can make sure the counter won't overflow.

The counter doesn't overflow today as well (thanks to the MIN calculation).

>
>> If the estimation is higher than actual available data, we may still read()
>> zero when the cached values are not zero.
>>
>> I think we should keep the code as is.
>>
>> Does that make sense?
> I can understand what got reported in VFIO_MIG_GET_PRECOPY_INFO may not be
> the total size of dirty data, but what the userapp can read.  That part is
> fine.
>
> Now, do you mean the size reported could shrink as well?

Yes.

>    Could you explain
> why, and when, dirtied data size can shrink?

First of all, the sizes of VFIO_MIG_GET_PRECOPY_INFO are defined as an 
estimation, so anything can be reported there.

But specifically for mlx5, VFIO_MIG_GET_PRECOPY_INFO includes two steps 
- query and save: first, the driver queries the device for the 
*expected* amount of data (that's the returned init/dirty sizes) and 
then the driver async-ly saves the data.
The query returns the *expected* amount of data, but the actual data 
returned in the save may be smaller.

Regarding VFIO_DEVICE_FEATURE_MIG_DATA_SIZE , I don't have a concrete 
example of data "shrink".
But I can think of a device that all its data is precopy-able and 
VFIO_DEVICE_FEATURE_MIG_DATA_SIZE may return X data remaining while 
actual data is smaller than X, simply because the uAPI defines that it's 
an estimate.

What I am trying to say is that VFIO_MIG_GET_PRECOPY_INFO and 
VFIO_DEVICE_FEATURE_MIG_DATA_SIZE sizes are estimates which are good 
enough for telling QEMU how much data is remaining, but shouldn't be 
used to make precise calculations on how much to read(). At least I 
don't see the benefit from it.

Thanks.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read
  2026-04-06 11:21       ` Avihai Horon
@ 2026-04-07 15:18         ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-07 15:18 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Mon, Apr 06, 2026 at 02:21:40PM +0300, Avihai Horon wrote:
> 
> On 4/1/2026 11:36 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Mar 25, 2026 at 04:10:14PM +0200, Avihai Horon wrote:
> > > Hi Peter,
> > Avihai,
> > 
> > > Thanks for sending this series.
> > Thanks for taking a look.
> > 
> > > On 3/20/2026 1:12 AM, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > During precopy phase, VFIO maintains two counters for init/dirty data
> > > > tracking for query estimations.
> > > > 
> > > > VFIO fetches data during precopy by reading from the VFIO fd, after
> > > > fetching it'll deduct the read size.
> > > > 
> > > > Here since the fd's size can dynamically change, I think it means VFIO may
> > > > read more than what it "thought" were there for fetching.
> > > > 
> > > > I highly suspect it's also relevant to a weird case in the function of
> > > > vfio_update_estimated_pending_data(), where when VFIO reads 0 from the FD
> > > > it will _reset_ the two counters, instead of asserting both of them being
> > > > zeros, which looks pretty hackish.
> > > > 
> > > > Just guarantee it from userspace level that VFIO won't read more than what
> > > > it expects for now.
> > > The VFIO_MIG_GET_PRECOPY_INFO ioctl returns an estimation of the data size
> > > currently available for reading. So, even if the ioctl returns X bytes, it
> > > may be that there are more than X bytes to read or less than X bytes.
> > > The code was written in a flexible way to handle such discrepancies.
> > > 
> > > Because we are dealing with an estimation, I don't think we can assert that
> > > the counters are zero, and I don't think reading only up to the cached size
> > > gives us any benefit:
> > > If the estimation is lower than actual available data, we are just deferring
> > > sending of the remaining data to a later stage.
> > Since we'll introduce cached size, having the read() only happen with the
> > size reported still makes sense to me.
> > 
> > We're not deferring to later that much, when dirty data reaches zero, we'll
> > re-sync with everything including VFIO's VFIO_MIG_GET_PRECOPY_INFO.  So
> > it's just splitting one last-phase read() into two smaller read()s.  To me,
> > it sounds still OK if with that we can make sure the counter won't overflow.
> 
> The counter doesn't overflow today as well (thanks to the MIN calculation).
> 
> > 
> > > If the estimation is higher than actual available data, we may still read()
> > > zero when the cached values are not zero.
> > > 
> > > I think we should keep the code as is.
> > > 
> > > Does that make sense?
> > I can understand what got reported in VFIO_MIG_GET_PRECOPY_INFO may not be
> > the total size of dirty data, but what the userapp can read.  That part is
> > fine.
> > 
> > Now, do you mean the size reported could shrink as well?
> 
> Yes.
> 
> >    Could you explain
> > why, and when, dirtied data size can shrink?
> 
> First of all, the sizes of VFIO_MIG_GET_PRECOPY_INFO are defined as an
> estimation, so anything can be reported there.
> 
> But specifically for mlx5, VFIO_MIG_GET_PRECOPY_INFO includes two steps -
> query and save: first, the driver queries the device for the *expected*
> amount of data (that's the returned init/dirty sizes) and then the driver
> async-ly saves the data.
> The query returns the *expected* amount of data, but the actual data
> returned in the save may be smaller.
> 
> Regarding VFIO_DEVICE_FEATURE_MIG_DATA_SIZE , I don't have a concrete
> example of data "shrink".
> But I can think of a device that all its data is precopy-able and
> VFIO_DEVICE_FEATURE_MIG_DATA_SIZE may return X data remaining while actual
> data is smaller than X, simply because the uAPI defines that it's an
> estimate.
> 
> What I am trying to say is that VFIO_MIG_GET_PRECOPY_INFO and
> VFIO_DEVICE_FEATURE_MIG_DATA_SIZE sizes are estimates which are good enough
> for telling QEMU how much data is remaining, but shouldn't be used to make
> precise calculations on how much to read(). At least I don't see the benefit
> from it.

If it can shrink too, then yes, there's no point.

I think what I'll do is assuming both ioctls can report anything (including
garbage.. which sadly also follows the API definition), then make sure it
keeps working and make the best use of estimations based on them.

Thanks for the answers.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (2 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-25 14:15   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

Add a field to cache stop size.  Note that there's an initial value change
in vfio_save_setup for the stop size default, but it shouldn't matter if it
is followed with a math of MIN() against VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/vfio-migration-internal.h |  1 +
 hw/vfio/migration.c               | 43 +++++++++++++++++--------------
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
index 814fbd9eba..08df32c055 100644
--- a/hw/vfio/vfio-migration-internal.h
+++ b/hw/vfio/vfio-migration-internal.h
@@ -47,6 +47,7 @@ typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    uint64_t stopcopy_size;
     bool multifd_transfer;
     VFIOMultifd *multifd;
     bool initial_data_sent;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 851ea783f3..827d3ded63 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -41,6 +41,12 @@
  */
 #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
 
+/*
+ * Migration size of VFIO devices can be as little as a few KBs or as big as
+ * many GBs. This value should be big enough to cover the worst case.
+ */
+#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
+
 static unsigned long bytes_transferred;
 
 static const char *mig_state_to_str(enum vfio_device_mig_state state)
@@ -314,8 +320,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
     migration->data_fd = -1;
 }
 
-static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
-                                     uint64_t *stop_copy_size)
+static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
 {
     uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
                               sizeof(struct vfio_device_feature_mig_data_size),
@@ -323,16 +328,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
     struct vfio_device_feature_mig_data_size *mig_data_size =
         (struct vfio_device_feature_mig_data_size *)feature->data;
+    VFIOMigration *migration = vbasedev->migration;
 
     feature->argsz = sizeof(buf);
     feature->flags =
         VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
 
     if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        /*
+         * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE
+         * is reported so downtime limit won't be violated.
+         */
+        migration->stopcopy_size = VFIO_MIG_STOP_COPY_SIZE;
         return -errno;
     }
 
-    *stop_copy_size = mig_data_size->stop_copy_length;
+    migration->stopcopy_size = mig_data_size->stop_copy_length;
 
     return 0;
 }
@@ -415,6 +426,9 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
         return;
     }
 
+    /* The total size remaining requires separate accounting */
+    migration->stopcopy_size -= data_size;
+
     if (migration->precopy_init_size) {
         uint64_t init_size = MIN(migration->precopy_init_size, data_size);
 
@@ -469,7 +483,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
     int ret;
 
     if (!vfio_multifd_setup(vbasedev, false, errp)) {
@@ -478,9 +491,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
-    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    vfio_query_stop_copy_size(vbasedev);
     migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
-                                      stop_copy_size);
+                                      migration->stopcopy_size);
     migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
     if (!migration->data_buffer) {
         error_setg(errp, "%s: Failed to allocate migration data buffer",
@@ -576,32 +589,22 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
                                       migration->precopy_dirty_size);
 }
 
-/*
- * Migration size of VFIO devices can be as little as a few KBs or as big as
- * many GBs. This value should be big enough to cover the worst case.
- */
-#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
-
 static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
-    /*
-     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
-     * reported so downtime limit won't be violated.
-     */
-    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
-    *must_precopy += stop_copy_size;
+    vfio_query_stop_copy_size(vbasedev);
+    *must_precopy += migration->stopcopy_size;
 
     if (vfio_device_state_is_precopy(vbasedev)) {
         vfio_query_precopy_size(migration);
     }
 
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->stopcopy_size,
+                                   migration->precopy_init_size,
                                    migration->precopy_dirty_size);
 }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration
  2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
@ 2026-03-25 14:15   ` Avihai Horon
  2026-04-01 20:41     ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-03-25 14:15 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Add a field to cache stop size.  Note that there's an initial value change
> in vfio_save_setup for the stop size default, but it shouldn't matter if it
> is followed with a math of MIN() against VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/vfio-migration-internal.h |  1 +
>   hw/vfio/migration.c               | 43 +++++++++++++++++--------------
>   2 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-internal.h
> index 814fbd9eba..08df32c055 100644
> --- a/hw/vfio/vfio-migration-internal.h
> +++ b/hw/vfio/vfio-migration-internal.h
> @@ -47,6 +47,7 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    uint64_t stopcopy_size;
>       bool multifd_transfer;
>       VFIOMultifd *multifd;
>       bool initial_data_sent;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 851ea783f3..827d3ded63 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -41,6 +41,12 @@
>    */
>   #define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
>
> +/*
> + * Migration size of VFIO devices can be as little as a few KBs or as big as
> + * many GBs. This value should be big enough to cover the worst case.
> + */
> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> +
>   static unsigned long bytes_transferred;
>
>   static const char *mig_state_to_str(enum vfio_device_mig_state state)
> @@ -314,8 +320,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
>       migration->data_fd = -1;
>   }
>
> -static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
> -                                     uint64_t *stop_copy_size)
> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev)
>   {
>       uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>                                 sizeof(struct vfio_device_feature_mig_data_size),
> @@ -323,16 +328,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>       struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
>       struct vfio_device_feature_mig_data_size *mig_data_size =
>           (struct vfio_device_feature_mig_data_size *)feature->data;
> +    VFIOMigration *migration = vbasedev->migration;
>
>       feature->argsz = sizeof(buf);
>       feature->flags =
>           VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
>
>       if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
> +        /*
> +         * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE
> +         * is reported so downtime limit won't be violated.
> +         */
> +        migration->stopcopy_size = VFIO_MIG_STOP_COPY_SIZE;
>           return -errno;
>       }
>
> -    *stop_copy_size = mig_data_size->stop_copy_length;
> +    migration->stopcopy_size = mig_data_size->stop_copy_length;
>
>       return 0;
>   }
> @@ -415,6 +426,9 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>           return;
>       }
>
> +    /* The total size remaining requires separate accounting */
> +    migration->stopcopy_size -= data_size;

stopcopy_size is also an estimation, so I think it's safer to have:

   migration->stopcopy_size -= MIN(migration->stopcopy_size, data_size);

Thanks.

> +
>       if (migration->precopy_init_size) {
>           uint64_t init_size = MIN(migration->precopy_init_size, data_size);
>
> @@ -469,7 +483,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
>       int ret;
>
>       if (!vfio_multifd_setup(vbasedev, false, errp)) {
> @@ -478,9 +491,9 @@ static int vfio_save_setup(QEMUFile *f, void *opaque, Error **errp)
>
>       qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>
> -    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> +    vfio_query_stop_copy_size(vbasedev);
>       migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
> -                                      stop_copy_size);
> +                                      migration->stopcopy_size);
>       migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
>       if (!migration->data_buffer) {
>           error_setg(errp, "%s: Failed to allocate migration data buffer",
> @@ -576,32 +589,22 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>                                         migration->precopy_dirty_size);
>   }
>
> -/*
> - * Migration size of VFIO devices can be as little as a few KBs or as big as
> - * many GBs. This value should be big enough to cover the worst case.
> - */
> -#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> -
>   static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>
> -    /*
> -     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> -     * reported so downtime limit won't be violated.
> -     */
> -    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> -    *must_precopy += stop_copy_size;
> +    vfio_query_stop_copy_size(vbasedev);
> +    *must_precopy += migration->stopcopy_size;
>
>       if (vfio_device_state_is_precopy(vbasedev)) {
>           vfio_query_precopy_size(migration);
>       }
>
>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   stop_copy_size, migration->precopy_init_size,
> +                                   migration->stopcopy_size,
> +                                   migration->precopy_init_size,
>                                      migration->precopy_dirty_size);
>   }
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration
  2026-03-25 14:15   ` Avihai Horon
@ 2026-04-01 20:41     ` Peter Xu
  2026-04-06 11:28       ` Avihai Horon
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-01 20:41 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Wed, Mar 25, 2026 at 04:15:45PM +0200, Avihai Horon wrote:
> > @@ -415,6 +426,9 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
> >           return;
> >       }
> > 
> > +    /* The total size remaining requires separate accounting */
> > +    migration->stopcopy_size -= data_size;
> 
> stopcopy_size is also an estimation, so I think it's safer to have:
> 
>   migration->stopcopy_size -= MIN(migration->stopcopy_size, data_size);

I believe this is very relevant of the question in the previous patch, on
whether reported value can be shrinked.  In this case, it's about
VFIO_DEVICE_FEATURE_MIG_DATA_SIZE instead.  We can discuss both ioctls there.

I hope it's clear to you on what I wanted to do with this series, which is
to have counters cached in QEMU so as to align the estimate/exact reports.
The counters should (in my hope) be 100% accurate reflecting the current
status of VFIO device, not necessarily the latest.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration
  2026-04-01 20:41     ` Peter Xu
@ 2026-04-06 11:28       ` Avihai Horon
  0 siblings, 0 replies; 67+ messages in thread
From: Avihai Horon @ 2026-04-06 11:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 4/1/2026 11:41 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Mar 25, 2026 at 04:15:45PM +0200, Avihai Horon wrote:
>>> @@ -415,6 +426,9 @@ static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>>>            return;
>>>        }
>>>
>>> +    /* The total size remaining requires separate accounting */
>>> +    migration->stopcopy_size -= data_size;
>> stopcopy_size is also an estimation, so I think it's safer to have:
>>
>>    migration->stopcopy_size -= MIN(migration->stopcopy_size, data_size);
> I believe this is very relevant of the question in the previous patch, on
> whether reported value can be shrinked.  In this case, it's about
> VFIO_DEVICE_FEATURE_MIG_DATA_SIZE instead.  We can discuss both ioctls there.
>
> I hope it's clear to you on what I wanted to do with this series, which is
> to have counters cached in QEMU so as to align the estimate/exact reports.
> The counters should (in my hope) be 100% accurate reflecting the current
> status of VFIO device, not necessarily the latest.

Yes, I understand, but the counters are not 100% accurate (even for the 
current status) - both as defined by the uAPI and in reality. However, 
they are still good enough for the estimate/exact reports.
VFIO devices are a bit different than RAM, where everything is very 
precise and you can know the exact remaining size and remaining size can 
only grow.

Thanks.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (3 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-24 10:35   ` Prasad Pandit
                     ` (2 more replies)
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Halil Pasic, Christian Borntraeger, Jason Herne, Eric Farman,
	Matthew Rosato, Richard Henderson, Ilya Leoshkevich,
	David Hildenbrand, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow

These two APIs are a slight duplication.  For example, there're a few users
that directly pass in the same function.

It might also be slightly error prone to provide two hooks, so that it's
easier to happen that one module report different things via the two
hooks. In reality they should always report the same thing, only about
whether we should use a fast-path when the slow path might be too slow, and
even if we need to pay with some less accuracy.

Let's just merge it into one API, but instead provide a bool showing if the
query is a fast query or not.

No functional change intended.

Export qemu_savevm_query_pending().  We should likely directly use the new
API here provided when there're new users to do the query.  This will
happen very soon.

Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Jason Herne <jjherne@linux.ibm.com>
Cc: Eric Farman <farman@linux.ibm.com>
Cc: Matthew Rosato <mjrosato@linux.ibm.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Cc: John Snow <jsnow@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/devel/migration/main.rst  |  5 ++-
 docs/devel/migration/vfio.rst  |  9 ++----
 include/migration/register.h   | 52 ++++++++++--------------------
 migration/savevm.h             |  3 ++
 hw/s390x/s390-stattrib.c       |  8 ++---
 hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
 migration/block-dirty-bitmap.c |  9 ++----
 migration/ram.c                | 32 ++++++-------------
 migration/savevm.c             | 43 ++++++++++++-------------
 hw/vfio/trace-events           |  3 +-
 10 files changed, 93 insertions(+), 129 deletions(-)

diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
index 234d280249..22c5910d5c 100644
--- a/docs/devel/migration/main.rst
+++ b/docs/devel/migration/main.rst
@@ -519,9 +519,8 @@ An iterative device must provide:
     data we must save.  The core migration code will use this to
     determine when to pause the CPUs and complete the migration.
 
-  - A ``state_pending_estimate`` function that indicates how much more
-    data we must save.  When the estimated amount is smaller than the
-    threshold, we call ``state_pending_exact``.
+  - A ``save_query_pending`` function that indicates how much more
+    data we must save.
 
   - A ``save_live_iterate`` function should send a chunk of data until
     the point that stream bandwidth limits tell it to stop.  Each call
diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index 0790e5031d..33768c877c 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
-* A ``state_pending_estimate`` function that reports an estimate of the
-  remaining pre-copy data that the vendor driver has yet to save for the VFIO
-  device.
-
-* A ``state_pending_exact`` function that reads pending_bytes from the vendor
-  driver, which indicates the amount of data that the vendor driver has yet to
-  save for the VFIO device.
+* A ``save_query_pending`` function that reports the remaining pre-copy
+  data that the vendor driver has yet to save for the VFIO device.
 
 * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
   active only when the VFIO device is in pre-copy states.
diff --git a/include/migration/register.h b/include/migration/register.h
index d0f37f5f43..2320c3a981 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -16,6 +16,15 @@
 
 #include "hw/core/vmstate-if.h"
 
+typedef struct MigPendingData {
+    /* How many bytes are pending for precopy / stopcopy? */
+    uint64_t precopy_bytes;
+    /* How many bytes are pending that can be transferred in postcopy? */
+    uint64_t postcopy_bytes;
+    /* Is this a fastpath query (which can be inaccurate)? */
+    bool fastpath;
+} MigPendingData ;
+
 /**
  * struct SaveVMHandlers: handler structure to finely control
  * migration of complex subsystems and devices, such as RAM, block and
@@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
     bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
 
     /**
-     * @state_pending_estimate
-     *
-     * This estimates the remaining data to transfer
-     *
-     * Sum of @can_postcopy and @must_postcopy is the whole amount of
-     * pending data.
-     *
-     * @opaque: data pointer passed to register_savevm_live()
-     * @must_precopy: amount of data that must be migrated in precopy
-     *                or in stopped state, i.e. that must be migrated
-     *                before target start.
-     * @can_postcopy: amount of data that can be migrated in postcopy
-     *                or in stopped state, i.e. after target start.
-     *                Some can also be migrated during precopy (RAM).
-     *                Some must be migrated after source stops
-     *                (block-dirty-bitmap)
-     */
-    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
-                                   uint64_t *can_postcopy);
-
-    /**
-     * @state_pending_exact
-     *
-     * This calculates the exact remaining data to transfer
+     * @save_query_pending
      *
-     * Sum of @can_postcopy and @must_postcopy is the whole amount of
-     * pending data.
+     * This estimates the remaining data to transfer on the source side.
+     * It's highly suggested that the module should implement both fastpath
+     * and slowpath version of it when it can be slow (for more information
+     * please check pending->fastpath field).
      *
      * @opaque: data pointer passed to register_savevm_live()
-     * @must_precopy: amount of data that must be migrated in precopy
-     *                or in stopped state, i.e. that must be migrated
-     *                before target start.
-     * @can_postcopy: amount of data that can be migrated in postcopy
-     *                or in stopped state, i.e. after target start.
-     *                Some can also be migrated during precopy (RAM).
-     *                Some must be migrated after source stops
-     *                (block-dirty-bitmap)
+     * @pending: pointer to a MigPendingData struct
      */
-    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
-                                uint64_t *can_postcopy);
+    void (*save_query_pending)(void *opaque, MigPendingData *pending);
 
     /**
      * @load_state
diff --git a/migration/savevm.h b/migration/savevm.h
index b3d1e8a13c..b116933bce 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -14,6 +14,8 @@
 #ifndef MIGRATION_SAVEVM_H
 #define MIGRATION_SAVEVM_H
 
+#include "migration/register.h"
+
 #define QEMU_VM_FILE_MAGIC           0x5145564d
 #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
 #define QEMU_VM_FILE_VERSION         0x00000003
@@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
+void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
 void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                                      uint64_t *can_postcopy);
 void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
index d808ece3b9..b1ec51c77a 100644
--- a/hw/s390x/s390-stattrib.c
+++ b/hw/s390x/s390-stattrib.c
@@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
     return 0;
 }
 
-static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
-                               uint64_t *can_postcopy)
+static void cmma_state_pending(void *opaque, MigPendingData *pending)
 {
     S390StAttribState *sas = S390_STATTRIB(opaque);
     S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
     long long res = sac->get_dirtycount(sas);
 
     if (res >= 0) {
-        *must_precopy += res;
+        pending->precopy_bytes += res;
     }
 }
 
@@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
     .save_setup = cmma_save_setup,
     .save_live_iterate = cmma_save_iterate,
     .save_complete = cmma_save_complete,
-    .state_pending_exact = cmma_state_pending,
-    .state_pending_estimate = cmma_state_pending,
+    .save_query_pending = cmma_state_pending,
     .save_cleanup = cmma_save_cleanup,
     .load_state = cmma_load,
     .is_active = cmma_active,
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 827d3ded63..c054c749b0 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
+static void vfio_state_pending_sync(VFIODevice *vbasedev)
 {
-    VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
-    if (!vfio_device_state_is_precopy(vbasedev)) {
-        return;
-    }
+    vfio_query_stop_copy_size(vbasedev);
 
-    *must_precopy +=
-        migration->precopy_init_size + migration->precopy_dirty_size;
+    if (vfio_device_state_is_precopy(vbasedev)) {
+        vfio_query_precopy_size(migration);
+    }
 
-    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
-                                      *can_postcopy,
-                                      migration->precopy_init_size,
-                                      migration->precopy_dirty_size);
+    /*
+     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
+     * Otherwise we have a problem.  So far, only dump some errors.
+     */
+    if (migration->precopy_init_size + migration->precopy_dirty_size <
+        migration->stopcopy_size) {
+        error_report_once("%s: wrong pending data (init=%" PRIx64
+                          ", dirty=%"PRIx64", stop=%"PRIx64")",
+                          __func__, migration->precopy_init_size,
+                          migration->precopy_dirty_size,
+                          migration->stopcopy_size);
+    }
 }
 
-static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
-                                     uint64_t *can_postcopy)
+static void vfio_state_pending(void *opaque, MigPendingData *pending)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
+    uint64_t remain;
 
-    vfio_query_stop_copy_size(vbasedev);
-    *must_precopy += migration->stopcopy_size;
-
-    if (vfio_device_state_is_precopy(vbasedev)) {
-        vfio_query_precopy_size(migration);
+    if (pending->fastpath) {
+        if (!vfio_device_state_is_precopy(vbasedev)) {
+            return;
+        }
+        remain = migration->precopy_init_size + migration->precopy_dirty_size;
+    } else {
+        vfio_state_pending_sync(vbasedev);
+        remain = migration->stopcopy_size;
     }
 
-    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   migration->stopcopy_size,
-                                   migration->precopy_init_size,
-                                   migration->precopy_dirty_size);
+    pending->precopy_bytes += remain;
+
+    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
+                             migration->precopy_init_size,
+                             migration->precopy_dirty_size);
 }
 
 static bool vfio_is_active_iterate(void *opaque)
@@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .save_prepare = vfio_save_prepare,
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
-    .state_pending_estimate = vfio_state_pending_estimate,
-    .state_pending_exact = vfio_state_pending_exact,
+    .save_query_pending = vfio_state_pending,
     .is_active_iterate = vfio_is_active_iterate,
     .save_live_iterate = vfio_save_iterate,
     .save_complete = vfio_save_complete_precopy,
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index a061aad817..376a9b43ac 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void dirty_bitmap_state_pending(void *opaque,
-                                       uint64_t *must_precopy,
-                                       uint64_t *can_postcopy)
+static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
 {
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms;
@@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
 
     trace_dirty_bitmap_state_pending(pending);
 
-    *can_postcopy += pending;
+    data->postcopy_bytes += pending;
 }
 
 /* First occurrence of this bitmap. It should be created if doesn't exist */
@@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
     .save_setup = dirty_bitmap_save_setup,
     .save_complete = dirty_bitmap_save_complete,
     .has_postcopy = dirty_bitmap_has_postcopy,
-    .state_pending_exact = dirty_bitmap_state_pending,
-    .state_pending_estimate = dirty_bitmap_state_pending,
+    .save_query_pending = dirty_bitmap_state_pending,
     .save_live_iterate = dirty_bitmap_save_iterate,
     .is_active_iterate = dirty_bitmap_is_active_iterate,
     .load_state = dirty_bitmap_load,
diff --git a/migration/ram.c b/migration/ram.c
index 979751f61b..89f761a471 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     return qemu_fflush(f);
 }
 
-static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
-                                       uint64_t *can_postcopy)
-{
-    RAMState **temp = opaque;
-    RAMState *rs = *temp;
-
-    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
-
-    if (migrate_postcopy_ram()) {
-        /* We can do postcopy, and all the data is postcopiable */
-        *can_postcopy += remaining_size;
-    } else {
-        *must_precopy += remaining_size;
-    }
-}
-
-static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
-                                    uint64_t *can_postcopy)
+static void ram_state_pending(void *opaque, MigPendingData *pending)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
     uint64_t remaining_size;
 
-    if (!migration_in_postcopy()) {
+    /*
+     * Sync is only needed either with: (1) a fast query, or (2) postcopy
+     * as started (in which case no new dirty will generate anymore).
+     */
+    if (!pending->fastpath && !migration_in_postcopy()) {
         bql_lock();
         WITH_RCU_READ_LOCK_GUARD() {
             migration_bitmap_sync_precopy(false);
@@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
 
     if (migrate_postcopy_ram()) {
         /* We can do postcopy, and all the data is postcopiable */
-        *can_postcopy += remaining_size;
+        pending->postcopy_bytes += remaining_size;
     } else {
-        *must_precopy += remaining_size;
+        pending->precopy_bytes += remaining_size;
     }
 }
 
@@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
     .save_live_iterate = ram_save_iterate,
     .save_complete = ram_save_complete,
     .has_postcopy = ram_has_postcopy,
-    .state_pending_exact = ram_state_pending_exact,
-    .state_pending_estimate = ram_state_pending_estimate,
+    .save_query_pending = ram_state_pending,
     .load_state = ram_load,
     .save_cleanup = ram_save_cleanup,
     .load_setup = ram_load_setup,
diff --git a/migration/savevm.c b/migration/savevm.c
index dd58f2a705..6268e68382 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
     return qemu_fflush(f);
 }
 
-/* Give an estimate of the amount left to be transferred,
- * the result is split into the amount for units that can and
- * for units that can't do postcopy.
- */
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
+void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
 {
     SaveStateEntry *se;
 
-    *must_precopy = 0;
-    *can_postcopy = 0;
+    pending->precopy_bytes = 0;
+    pending->postcopy_bytes = 0;
+    pending->fastpath = fastpath;
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->state_pending_estimate) {
+        if (!se->ops || !se->ops->save_query_pending) {
             continue;
         }
         if (!qemu_savevm_state_active(se)) {
             continue;
         }
-        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
+        se->ops->save_query_pending(se->opaque, pending);
     }
 }
 
+void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    MigPendingData pending;
+
+    qemu_savevm_query_pending(&pending, true);
+
+    *must_precopy = pending.precopy_bytes;
+    *can_postcopy = pending.postcopy_bytes;
+}
+
 void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
-    SaveStateEntry *se;
+    MigPendingData pending;
 
-    *must_precopy = 0;
-    *can_postcopy = 0;
+    qemu_savevm_query_pending(&pending, false);
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops || !se->ops->state_pending_exact) {
-            continue;
-        }
-        if (!qemu_savevm_state_active(se)) {
-            continue;
-        }
-        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
-    }
+    *must_precopy = pending.precopy_bytes;
+    *can_postcopy = pending.postcopy_bytes;
 }
 
 void qemu_savevm_state_cleanup(void)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 846e3625c5..7cf5a9eb2d 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
 vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_save_iterate_start(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
-vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
+vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
@ 2026-03-24 10:35   ` Prasad Pandit
  2026-04-01 20:53     ` Peter Xu
  2026-03-25 15:20   ` Avihai Horon
  2026-03-27 15:17   ` Juraj Marcin
  2 siblings, 1 reply; 67+ messages in thread
From: Prasad Pandit @ 2026-03-24 10:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Halil Pasic, Christian Borntraeger, Jason Herne, Eric Farman,
	Matthew Rosato, Richard Henderson, Ilya Leoshkevich,
	David Hildenbrand, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
>
> It might also be slightly error prone to provide two hooks, so that it's
> easier to happen that one module report different things via the two
> hooks. In reality they should always report the same thing, only about
> whether we should use a fast-path when the slow path might be too slow, and
> even if we need to pay with some less accuracy.
>
> Let's just merge it into one API, but instead provide a bool showing if the
> query is a fast query or not.
>
> No functional change intended.
>
> Export qemu_savevm_query_pending().  We should likely directly use the new
> API here provided when there're new users to do the query.  This will
> happen very soon.
>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/devel/migration/main.rst  |  5 ++-
>  docs/devel/migration/vfio.rst  |  9 ++----
>  include/migration/register.h   | 52 ++++++++++--------------------
>  migration/savevm.h             |  3 ++
>  hw/s390x/s390-stattrib.c       |  8 ++---
>  hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
>  migration/block-dirty-bitmap.c |  9 ++----
>  migration/ram.c                | 32 ++++++-------------
>  migration/savevm.c             | 43 ++++++++++++-------------
>  hw/vfio/trace-events           |  3 +-
>  10 files changed, 93 insertions(+), 129 deletions(-)
>
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..22c5910d5c 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -519,9 +519,8 @@ An iterative device must provide:
>      data we must save.  The core migration code will use this to
>      determine when to pause the CPUs and complete the migration.
>
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>
>    - A ``save_live_iterate`` function should send a chunk of data until
>      the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>  * A ``load_setup`` function that sets the VFIO device on the destination in
>    _RESUMING state.
>
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.
>
>  * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>    active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..2320c3a981 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,15 @@
>
>  #include "hw/core/vmstate-if.h"
>
> +typedef struct MigPendingData {
> +    /* How many bytes are pending for precopy / stopcopy? */
> +    uint64_t precopy_bytes;
> +    /* How many bytes are pending that can be transferred in postcopy? */
> +    uint64_t postcopy_bytes;
> +    /* Is this a fastpath query (which can be inaccurate)? */
> +    bool fastpath;
> +} MigPendingData ;
> +

* Pending precopy bytes are: bytes still pending to be sent + dirty
(updated) bytes; How do we measure/estimate pending postcopy bytes?

* Could we do away with separating pending bytes into Precopy and
Postcopy bytes? Instead go with just pending_bytes on the source side?

* Let's not add another term/variable called stopcopy. It'll only add
to the confusion.


>  /**
>   * struct SaveVMHandlers: handler structure to finely control
>   * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
>      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>
>      /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>       *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).
>       *
>       * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
>       */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
>
>      /**
>       * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..b116933bce 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>  #ifndef MIGRATION_SAVEVM_H
>  #define MIGRATION_SAVEVM_H
>
> +#include "migration/register.h"
> +
>  #define QEMU_VM_FILE_MAGIC           0x5145564d
>  #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>  #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy);
>  void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..b1ec51c77a 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>      return 0;
>  }
>
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending)
>  {
>      S390StAttribState *sas = S390_STATTRIB(opaque);
>      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>      long long res = sac->get_dirtycount(sas);
>
>      if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>      }
>  }
>
> @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>      .save_setup = cmma_save_setup,
>      .save_live_iterate = cmma_save_iterate,
>      .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>      .save_cleanup = cmma_save_cleanup,
>      .load_state = cmma_load,
>      .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 827d3ded63..c054c749b0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>  {
> -    VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> +    vfio_query_stop_copy_size(vbasedev);
>
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    /*
> +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> +     * Otherwise we have a problem.  So far, only dump some errors.
> +     */
> +    if (migration->precopy_init_size + migration->precopy_dirty_size <
> +        migration->stopcopy_size) {
> +        error_report_once("%s: wrong pending data (init=%" PRIx64
> +                          ", dirty=%"PRIx64", stop=%"PRIx64")",
> +                          __func__, migration->precopy_init_size,
> +                          migration->precopy_dirty_size,
> +                          migration->stopcopy_size);
> +    }
>  }
>
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending)
>  {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (pending->fastpath) {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> +    } else {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
>      }
>
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);
>  }
>
>  static bool vfio_is_active_iterate(void *opaque)
> @@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>      .save_prepare = vfio_save_prepare,
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>      .is_active_iterate = vfio_is_active_iterate,
>      .save_live_iterate = vfio_save_iterate,
>      .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..376a9b43ac 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>      return 0;
>  }
>
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
>  {
>      DBMSaveState *s = &((DBMState *)opaque)->save;
>      SaveBitmapState *dbms;
> @@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>
>      trace_dirty_bitmap_state_pending(pending);
>
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>  }
>
>  /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>      .save_setup = dirty_bitmap_save_setup,
>      .save_complete = dirty_bitmap_save_complete,
>      .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>      .save_live_iterate = dirty_bitmap_save_iterate,
>      .is_active_iterate = dirty_bitmap_is_active_iterate,
>      .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..89f761a471 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      return qemu_fflush(f);
>  }
>
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending)
>  {
>      RAMState **temp = opaque;
>      RAMState *rs = *temp;
>      uint64_t remaining_size;
>
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is only needed either with: (1) a fast query, or (2) postcopy
> +     * as started (in which case no new dirty will generate anymore).

* as -> has
* (... no new dirty will generate anymore)  - ...?

> +     */
> +    if (!pending->fastpath && !migration_in_postcopy()) {

* The comment above and 'if' conditionals don't seem to match. We are
saying a call to 'migration_bitmap_sync_precopy()' is needed either
with fast query ie. 'pending->fastpath = true'  OR Postcopy has
started ie. migration_in_postcopy() = true, right? What is the
right/intended behaviour here?

>          bql_lock();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>
>      if (migrate_postcopy_ram()) {
>          /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>      } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>      }
>  }
>
> @@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_iterate = ram_save_iterate,
>      .save_complete = ram_save_complete,
>      .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>      .load_state = ram_load,
>      .save_cleanup = ram_save_cleanup,
>      .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..6268e68382 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>      return qemu_fflush(f);
>  }
>
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>  {
>      SaveStateEntry *se;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
> +    pending->fastpath = fastpath;
>
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>              continue;
>          }
>          if (!qemu_savevm_state_active(se)) {
>              continue;
>          }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending);
>      }
>  }
>
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, true);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
>  {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, false);
>
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>  }

* IIUC, we send fastpath=true from _pending_estimate() because we want
to avoid call to migration_bitmap_sync_precopy(), as calling it won't
be fast. And we send fastpath=false from _pending_exact() function
because we want to trigger call to migration_bitmap_sync_precopy()
routine.

* Instead of calling the boolean parameter 'fastpath',  let's use
'bool use_precision'  OR 'bool sync_data'. And then in
qemu_savevm_query_pending()

    qemu_savevm_query_pending():
            if (use_precision || sync_data) {
                     migration_bitmap_sync_precopy()
            }

* When reading code, equating fastpath with 'estimation' and slowpath
with 'precision/accuracy' is unnatural.


Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-24 10:35   ` Prasad Pandit
@ 2026-04-01 20:53     ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-01 20:53 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Halil Pasic, Christian Borntraeger, Jason Herne, Eric Farman,
	Matthew Rosato, Richard Henderson, Ilya Leoshkevich,
	David Hildenbrand, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow

On Tue, Mar 24, 2026 at 04:05:00PM +0530, Prasad Pandit wrote:
> > +typedef struct MigPendingData {
> > +    /* How many bytes are pending for precopy / stopcopy? */
> > +    uint64_t precopy_bytes;
> > +    /* How many bytes are pending that can be transferred in postcopy? */
> > +    uint64_t postcopy_bytes;
> > +    /* Is this a fastpath query (which can be inaccurate)? */
> > +    bool fastpath;
> > +} MigPendingData ;
> > +
> 
> * Pending precopy bytes are: bytes still pending to be sent + dirty
> (updated) bytes; How do we measure/estimate pending postcopy bytes?
> 
> * Could we do away with separating pending bytes into Precopy and
> Postcopy bytes? Instead go with just pending_bytes on the source side?
> 
> * Let's not add another term/variable called stopcopy. It'll only add
> to the confusion.

What would you suggest to solve this problem?

> 
> 
> >  /**
> >   * struct SaveVMHandlers: handler structure to finely control
> >   * migration of complex subsystems and devices, such as RAM, block and
> > @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
> >      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
> >
> >      /**
> > -     * @state_pending_estimate
> > -     *
> > -     * This estimates the remaining data to transfer
> > -     *
> > -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > -     * pending data.
> > -     *
> > -     * @opaque: data pointer passed to register_savevm_live()
> > -     * @must_precopy: amount of data that must be migrated in precopy
> > -     *                or in stopped state, i.e. that must be migrated
> > -     *                before target start.
> > -     * @can_postcopy: amount of data that can be migrated in postcopy
> > -     *                or in stopped state, i.e. after target start.
> > -     *                Some can also be migrated during precopy (RAM).
> > -     *                Some must be migrated after source stops
> > -     *                (block-dirty-bitmap)
> > -     */
> > -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> > -                                   uint64_t *can_postcopy);
> > -
> > -    /**
> > -     * @state_pending_exact
> > -     *
> > -     * This calculates the exact remaining data to transfer
> > +     * @save_query_pending
> >       *
> > -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > -     * pending data.
> > +     * This estimates the remaining data to transfer on the source side.
> > +     * It's highly suggested that the module should implement both fastpath
> > +     * and slowpath version of it when it can be slow (for more information
> > +     * please check pending->fastpath field).
> >       *
> >       * @opaque: data pointer passed to register_savevm_live()
> > -     * @must_precopy: amount of data that must be migrated in precopy
> > -     *                or in stopped state, i.e. that must be migrated
> > -     *                before target start.
> > -     * @can_postcopy: amount of data that can be migrated in postcopy
> > -     *                or in stopped state, i.e. after target start.
> > -     *                Some can also be migrated during precopy (RAM).
> > -     *                Some must be migrated after source stops
> > -     *                (block-dirty-bitmap)
> > +     * @pending: pointer to a MigPendingData struct
> >       */
> > -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> > -                                uint64_t *can_postcopy);
> > +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
> >
> >      /**
> >       * @load_state
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index b3d1e8a13c..b116933bce 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -14,6 +14,8 @@
> >  #ifndef MIGRATION_SAVEVM_H
> >  #define MIGRATION_SAVEVM_H
> >
> > +#include "migration/register.h"
> > +
> >  #define QEMU_VM_FILE_MAGIC           0x5145564d
> >  #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
> >  #define QEMU_VM_FILE_VERSION         0x00000003
> > @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> >  void qemu_savevm_state_cleanup(void);
> >  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
> >  int qemu_savevm_state_complete_precopy(MigrationState *s);
> > +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
> >  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> >                                       uint64_t *can_postcopy);
> >  void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> > diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> > index d808ece3b9..b1ec51c77a 100644
> > --- a/hw/s390x/s390-stattrib.c
> > +++ b/hw/s390x/s390-stattrib.c
> > @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
> >      return 0;
> >  }
> >
> > -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> > -                               uint64_t *can_postcopy)
> > +static void cmma_state_pending(void *opaque, MigPendingData *pending)
> >  {
> >      S390StAttribState *sas = S390_STATTRIB(opaque);
> >      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
> >      long long res = sac->get_dirtycount(sas);
> >
> >      if (res >= 0) {
> > -        *must_precopy += res;
> > +        pending->precopy_bytes += res;
> >      }
> >  }
> >
> > @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
> >      .save_setup = cmma_save_setup,
> >      .save_live_iterate = cmma_save_iterate,
> >      .save_complete = cmma_save_complete,
> > -    .state_pending_exact = cmma_state_pending,
> > -    .state_pending_estimate = cmma_state_pending,
> > +    .save_query_pending = cmma_state_pending,
> >      .save_cleanup = cmma_save_cleanup,
> >      .load_state = cmma_load,
> >      .is_active = cmma_active,
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 827d3ded63..c054c749b0 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
> >      trace_vfio_save_cleanup(vbasedev->name);
> >  }
> >
> > -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> > -                                        uint64_t *can_postcopy)
> > +static void vfio_state_pending_sync(VFIODevice *vbasedev)
> >  {
> > -    VFIODevice *vbasedev = opaque;
> >      VFIOMigration *migration = vbasedev->migration;
> >
> > -    if (!vfio_device_state_is_precopy(vbasedev)) {
> > -        return;
> > -    }
> > +    vfio_query_stop_copy_size(vbasedev);
> >
> > -    *must_precopy +=
> > -        migration->precopy_init_size + migration->precopy_dirty_size;
> > +    if (vfio_device_state_is_precopy(vbasedev)) {
> > +        vfio_query_precopy_size(migration);
> > +    }
> >
> > -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> > -                                      *can_postcopy,
> > -                                      migration->precopy_init_size,
> > -                                      migration->precopy_dirty_size);
> > +    /*
> > +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> > +     * Otherwise we have a problem.  So far, only dump some errors.
> > +     */
> > +    if (migration->precopy_init_size + migration->precopy_dirty_size <
> > +        migration->stopcopy_size) {
> > +        error_report_once("%s: wrong pending data (init=%" PRIx64
> > +                          ", dirty=%"PRIx64", stop=%"PRIx64")",
> > +                          __func__, migration->precopy_init_size,
> > +                          migration->precopy_dirty_size,
> > +                          migration->stopcopy_size);
> > +    }
> >  }
> >
> > -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> > -                                     uint64_t *can_postcopy)
> > +static void vfio_state_pending(void *opaque, MigPendingData *pending)
> >  {
> >      VFIODevice *vbasedev = opaque;
> >      VFIOMigration *migration = vbasedev->migration;
> > +    uint64_t remain;
> >
> > -    vfio_query_stop_copy_size(vbasedev);
> > -    *must_precopy += migration->stopcopy_size;
> > -
> > -    if (vfio_device_state_is_precopy(vbasedev)) {
> > -        vfio_query_precopy_size(migration);
> > +    if (pending->fastpath) {
> > +        if (!vfio_device_state_is_precopy(vbasedev)) {
> > +            return;
> > +        }
> > +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> > +    } else {
> > +        vfio_state_pending_sync(vbasedev);
> > +        remain = migration->stopcopy_size;
> >      }
> >
> > -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> > -                                   migration->stopcopy_size,
> > -                                   migration->precopy_init_size,
> > -                                   migration->precopy_dirty_size);
> > +    pending->precopy_bytes += remain;
> > +
> > +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> > +                             migration->precopy_init_size,
> > +                             migration->precopy_dirty_size);
> >  }
> >
> >  static bool vfio_is_active_iterate(void *opaque)
> > @@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
> >      .save_prepare = vfio_save_prepare,
> >      .save_setup = vfio_save_setup,
> >      .save_cleanup = vfio_save_cleanup,
> > -    .state_pending_estimate = vfio_state_pending_estimate,
> > -    .state_pending_exact = vfio_state_pending_exact,
> > +    .save_query_pending = vfio_state_pending,
> >      .is_active_iterate = vfio_is_active_iterate,
> >      .save_live_iterate = vfio_save_iterate,
> >      .save_complete = vfio_save_complete_precopy,
> > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> > index a061aad817..376a9b43ac 100644
> > --- a/migration/block-dirty-bitmap.c
> > +++ b/migration/block-dirty-bitmap.c
> > @@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
> >      return 0;
> >  }
> >
> > -static void dirty_bitmap_state_pending(void *opaque,
> > -                                       uint64_t *must_precopy,
> > -                                       uint64_t *can_postcopy)
> > +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
> >  {
> >      DBMSaveState *s = &((DBMState *)opaque)->save;
> >      SaveBitmapState *dbms;
> > @@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
> >
> >      trace_dirty_bitmap_state_pending(pending);
> >
> > -    *can_postcopy += pending;
> > +    data->postcopy_bytes += pending;
> >  }
> >
> >  /* First occurrence of this bitmap. It should be created if doesn't exist */
> > @@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
> >      .save_setup = dirty_bitmap_save_setup,
> >      .save_complete = dirty_bitmap_save_complete,
> >      .has_postcopy = dirty_bitmap_has_postcopy,
> > -    .state_pending_exact = dirty_bitmap_state_pending,
> > -    .state_pending_estimate = dirty_bitmap_state_pending,
> > +    .save_query_pending = dirty_bitmap_state_pending,
> >      .save_live_iterate = dirty_bitmap_save_iterate,
> >      .is_active_iterate = dirty_bitmap_is_active_iterate,
> >      .load_state = dirty_bitmap_load,
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 979751f61b..89f761a471 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> >      return qemu_fflush(f);
> >  }
> >
> > -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> > -                                       uint64_t *can_postcopy)
> > -{
> > -    RAMState **temp = opaque;
> > -    RAMState *rs = *temp;
> > -
> > -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> > -
> > -    if (migrate_postcopy_ram()) {
> > -        /* We can do postcopy, and all the data is postcopiable */
> > -        *can_postcopy += remaining_size;
> > -    } else {
> > -        *must_precopy += remaining_size;
> > -    }
> > -}
> > -
> > -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> > -                                    uint64_t *can_postcopy)
> > +static void ram_state_pending(void *opaque, MigPendingData *pending)
> >  {
> >      RAMState **temp = opaque;
> >      RAMState *rs = *temp;
> >      uint64_t remaining_size;
> >
> > -    if (!migration_in_postcopy()) {
> > +    /*
> > +     * Sync is only needed either with: (1) a fast query, or (2) postcopy
> > +     * as started (in which case no new dirty will generate anymore).
> 
> * as -> has
> * (... no new dirty will generate anymore)  - ...?
> 
> > +     */
> > +    if (!pending->fastpath && !migration_in_postcopy()) {
> 
> * The comment above and 'if' conditionals don't seem to match. We are
> saying a call to 'migration_bitmap_sync_precopy()' is needed either
> with fast query ie. 'pending->fastpath = true'  OR Postcopy has
> started ie. migration_in_postcopy() = true, right? What is the
> right/intended behaviour here?

It should have been s/only/not/.. I'll update comment to:

    /*
     * Sync is not needed either with: (1) a fast query, or (2) after
     * postcopy has started (no new dirty will generate anymore).
     */

> 
> >          bql_lock();
> >          WITH_RCU_READ_LOCK_GUARD() {
> >              migration_bitmap_sync_precopy(false);
> > @@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> >
> >      if (migrate_postcopy_ram()) {
> >          /* We can do postcopy, and all the data is postcopiable */
> > -        *can_postcopy += remaining_size;
> > +        pending->postcopy_bytes += remaining_size;
> >      } else {
> > -        *must_precopy += remaining_size;
> > +        pending->precopy_bytes += remaining_size;
> >      }
> >  }
> >
> > @@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
> >      .save_live_iterate = ram_save_iterate,
> >      .save_complete = ram_save_complete,
> >      .has_postcopy = ram_has_postcopy,
> > -    .state_pending_exact = ram_state_pending_exact,
> > -    .state_pending_estimate = ram_state_pending_estimate,
> > +    .save_query_pending = ram_state_pending,
> >      .load_state = ram_load,
> >      .save_cleanup = ram_save_cleanup,
> >      .load_setup = ram_load_setup,
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index dd58f2a705..6268e68382 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
> >      return qemu_fflush(f);
> >  }
> >
> > -/* Give an estimate of the amount left to be transferred,
> > - * the result is split into the amount for units that can and
> > - * for units that can't do postcopy.
> > - */
> > -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> > -                                        uint64_t *can_postcopy)
> > +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
> >  {
> >      SaveStateEntry *se;
> >
> > -    *must_precopy = 0;
> > -    *can_postcopy = 0;
> > +    pending->precopy_bytes = 0;
> > +    pending->postcopy_bytes = 0;
> > +    pending->fastpath = fastpath;
> >
> >      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> > -        if (!se->ops || !se->ops->state_pending_estimate) {
> > +        if (!se->ops || !se->ops->save_query_pending) {
> >              continue;
> >          }
> >          if (!qemu_savevm_state_active(se)) {
> >              continue;
> >          }
> > -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> > +        se->ops->save_query_pending(se->opaque, pending);
> >      }
> >  }
> >
> > +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> > +                                        uint64_t *can_postcopy)
> > +{
> > +    MigPendingData pending;
> > +
> > +    qemu_savevm_query_pending(&pending, true);
> > +
> > +    *must_precopy = pending.precopy_bytes;
> > +    *can_postcopy = pending.postcopy_bytes;
> > +}
> > +
> >  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> >                                       uint64_t *can_postcopy)
> >  {
> > -    SaveStateEntry *se;
> > +    MigPendingData pending;
> >
> > -    *must_precopy = 0;
> > -    *can_postcopy = 0;
> > +    qemu_savevm_query_pending(&pending, false);
> >
> > -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> > -        if (!se->ops || !se->ops->state_pending_exact) {
> > -            continue;
> > -        }
> > -        if (!qemu_savevm_state_active(se)) {
> > -            continue;
> > -        }
> > -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> > -    }
> > +    *must_precopy = pending.precopy_bytes;
> > +    *can_postcopy = pending.postcopy_bytes;
> >  }
> 
> * IIUC, we send fastpath=true from _pending_estimate() because we want
> to avoid call to migration_bitmap_sync_precopy(), as calling it won't
> be fast. And we send fastpath=false from _pending_exact() function
> because we want to trigger call to migration_bitmap_sync_precopy()
> routine.
> 
> * Instead of calling the boolean parameter 'fastpath',  let's use
> 'bool use_precision'  OR 'bool sync_data'. And then in
> qemu_savevm_query_pending()
> 
>     qemu_savevm_query_pending():
>             if (use_precision || sync_data) {
>                      migration_bitmap_sync_precopy()
>             }
> 
> * When reading code, equating fastpath with 'estimation' and slowpath
> with 'precision/accuracy' is unnatural.

I'll use "exact" as suggested by Juraj.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
  2026-03-24 10:35   ` Prasad Pandit
@ 2026-03-25 15:20   ` Avihai Horon
  2026-04-01 21:22     ` Peter Xu
  2026-03-27 15:17   ` Juraj Marcin
  2 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-03-25 15:20 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
>
> It might also be slightly error prone to provide two hooks, so that it's
> easier to happen that one module report different things via the two
> hooks. In reality they should always report the same thing, only about
> whether we should use a fast-path when the slow path might be too slow, and
> even if we need to pay with some less accuracy.
>
> Let's just merge it into one API, but instead provide a bool showing if the
> query is a fast query or not.
>
> No functional change intended.
>
> Export qemu_savevm_query_pending().  We should likely directly use the new
> API here provided when there're new users to do the query.  This will
> happen very soon.
>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   docs/devel/migration/main.rst  |  5 ++-
>   docs/devel/migration/vfio.rst  |  9 ++----
>   include/migration/register.h   | 52 ++++++++++--------------------
>   migration/savevm.h             |  3 ++
>   hw/s390x/s390-stattrib.c       |  8 ++---
>   hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
>   migration/block-dirty-bitmap.c |  9 ++----
>   migration/ram.c                | 32 ++++++-------------
>   migration/savevm.c             | 43 ++++++++++++-------------
>   hw/vfio/trace-events           |  3 +-
>   10 files changed, 93 insertions(+), 129 deletions(-)
>
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..22c5910d5c 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -519,9 +519,8 @@ An iterative device must provide:
>       data we must save.  The core migration code will use this to
>       determine when to pause the CPUs and complete the migration.

We should also remove the state_pending_exact section above.

>
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>
>     - A ``save_live_iterate`` function should send a chunk of data until
>       the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``load_setup`` function that sets the VFIO device on the destination in
>     _RESUMING state.
>
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.
>
>   * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>     active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..2320c3a981 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,15 @@
>
>   #include "hw/core/vmstate-if.h"
>
> +typedef struct MigPendingData {
> +    /* How many bytes are pending for precopy / stopcopy? */
> +    uint64_t precopy_bytes;
> +    /* How many bytes are pending that can be transferred in postcopy? */
> +    uint64_t postcopy_bytes;
> +    /* Is this a fastpath query (which can be inaccurate)? */
> +    bool fastpath;
> +} MigPendingData ;

Extra space before semicolon.

> +
>   /**
>    * struct SaveVMHandlers: handler structure to finely control
>    * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
>       bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>
>       /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>        *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).
>        *
>        * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
>        */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
>
>       /**
>        * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..b116933bce 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>   #ifndef MIGRATION_SAVEVM_H
>   #define MIGRATION_SAVEVM_H
>
> +#include "migration/register.h"
> +
>   #define QEMU_VM_FILE_MAGIC           0x5145564d
>   #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>   #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>   void qemu_savevm_state_cleanup(void);
>   void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>   int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
>   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                        uint64_t *can_postcopy);
>   void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..b1ec51c77a 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>       return 0;
>   }
>
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending)
>   {
>       S390StAttribState *sas = S390_STATTRIB(opaque);
>       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>       long long res = sac->get_dirtycount(sas);
>
>       if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>       }
>   }
>
> @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>       .save_setup = cmma_save_setup,
>       .save_live_iterate = cmma_save_iterate,
>       .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>       .save_cleanup = cmma_save_cleanup,
>       .load_state = cmma_load,
>       .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 827d3ded63..c054c749b0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
>
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>   {
> -    VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
>
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> +    vfio_query_stop_copy_size(vbasedev);
>
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    /*
> +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> +     * Otherwise we have a problem.  So far, only dump some errors.
> +     */

I'm not sure that must be true, as stopcopy and precopy queries are not 
atomic, so device state size can change between the queries.

Thanks.

> +    if (migration->precopy_init_size + migration->precopy_dirty_size <
> +        migration->stopcopy_size) {
> +        error_report_once("%s: wrong pending data (init=%" PRIx64
> +                          ", dirty=%"PRIx64", stop=%"PRIx64")",
> +                          __func__, migration->precopy_init_size,
> +                          migration->precopy_dirty_size,
> +                          migration->stopcopy_size);
> +    }
>   }
>
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (pending->fastpath) {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> +    } else {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
>       }
>
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);
>   }
>
>   static bool vfio_is_active_iterate(void *opaque)
> @@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_prepare = vfio_save_prepare,
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>       .is_active_iterate = vfio_is_active_iterate,
>       .save_live_iterate = vfio_save_iterate,
>       .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..376a9b43ac 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>       return 0;
>   }
>
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
>   {
>       DBMSaveState *s = &((DBMState *)opaque)->save;
>       SaveBitmapState *dbms;
> @@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>
>       trace_dirty_bitmap_state_pending(pending);
>
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>   }
>
>   /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>       .save_setup = dirty_bitmap_save_setup,
>       .save_complete = dirty_bitmap_save_complete,
>       .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>       .save_live_iterate = dirty_bitmap_save_iterate,
>       .is_active_iterate = dirty_bitmap_is_active_iterate,
>       .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..89f761a471 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>       return qemu_fflush(f);
>   }
>
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending)
>   {
>       RAMState **temp = opaque;
>       RAMState *rs = *temp;
>       uint64_t remaining_size;
>
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is only needed either with: (1) a fast query, or (2) postcopy
> +     * as started (in which case no new dirty will generate anymore).
> +     */
> +    if (!pending->fastpath && !migration_in_postcopy()) {
>           bql_lock();
>           WITH_RCU_READ_LOCK_GUARD() {
>               migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>
>       if (migrate_postcopy_ram()) {
>           /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>       } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>       }
>   }
>
> @@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>       .save_live_iterate = ram_save_iterate,
>       .save_complete = ram_save_complete,
>       .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>       .load_state = ram_load,
>       .save_cleanup = ram_save_cleanup,
>       .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..6268e68382 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>       return qemu_fflush(f);
>   }
>
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>   {
>       SaveStateEntry *se;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
> +    pending->fastpath = fastpath;
>
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>               continue;
>           }
>           if (!qemu_savevm_state_active(se)) {
>               continue;
>           }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending);
>       }
>   }
>
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, true);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, false);
>
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>   }
>
>   void qemu_savevm_state_cleanup(void)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 846e3625c5..7cf5a9eb2d 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
>   vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_save_iterate_start(const char *name) " (%s)"
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
> -vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> +vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>   vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-25 15:20   ` Avihai Horon
@ 2026-04-01 21:22     ` Peter Xu
  2026-04-06 11:54       ` Avihai Horon
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-01 21:22 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow

On Wed, Mar 25, 2026 at 05:20:41PM +0200, Avihai Horon wrote:
> 
> On 3/20/2026 1:12 AM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > These two APIs are a slight duplication.  For example, there're a few users
> > that directly pass in the same function.
> > 
> > It might also be slightly error prone to provide two hooks, so that it's
> > easier to happen that one module report different things via the two
> > hooks. In reality they should always report the same thing, only about
> > whether we should use a fast-path when the slow path might be too slow, and
> > even if we need to pay with some less accuracy.
> > 
> > Let's just merge it into one API, but instead provide a bool showing if the
> > query is a fast query or not.
> > 
> > No functional change intended.
> > 
> > Export qemu_savevm_query_pending().  We should likely directly use the new
> > API here provided when there're new users to do the query.  This will
> > happen very soon.
> > 
> > Cc: Halil Pasic <pasic@linux.ibm.com>
> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> > Cc: Jason Herne <jjherne@linux.ibm.com>
> > Cc: Eric Farman <farman@linux.ibm.com>
> > Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>
> > Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> > Cc: John Snow <jsnow@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   docs/devel/migration/main.rst  |  5 ++-
> >   docs/devel/migration/vfio.rst  |  9 ++----
> >   include/migration/register.h   | 52 ++++++++++--------------------
> >   migration/savevm.h             |  3 ++
> >   hw/s390x/s390-stattrib.c       |  8 ++---
> >   hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
> >   migration/block-dirty-bitmap.c |  9 ++----
> >   migration/ram.c                | 32 ++++++-------------
> >   migration/savevm.c             | 43 ++++++++++++-------------
> >   hw/vfio/trace-events           |  3 +-
> >   10 files changed, 93 insertions(+), 129 deletions(-)
> > 
> > diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> > index 234d280249..22c5910d5c 100644
> > --- a/docs/devel/migration/main.rst
> > +++ b/docs/devel/migration/main.rst
> > @@ -519,9 +519,8 @@ An iterative device must provide:
> >       data we must save.  The core migration code will use this to
> >       determine when to pause the CPUs and complete the migration.
> 
> We should also remove the state_pending_exact section above.

Ah yes, will do.

> 
> > 
> > -  - A ``state_pending_estimate`` function that indicates how much more
> > -    data we must save.  When the estimated amount is smaller than the
> > -    threshold, we call ``state_pending_exact``.
> > +  - A ``save_query_pending`` function that indicates how much more
> > +    data we must save.
> > 
> >     - A ``save_live_iterate`` function should send a chunk of data until
> >       the point that stream bandwidth limits tell it to stop.  Each call
> > diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> > index 0790e5031d..33768c877c 100644
> > --- a/docs/devel/migration/vfio.rst
> > +++ b/docs/devel/migration/vfio.rst
> > @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
> >   * A ``load_setup`` function that sets the VFIO device on the destination in
> >     _RESUMING state.
> > 
> > -* A ``state_pending_estimate`` function that reports an estimate of the
> > -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> > -  device.
> > -
> > -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> > -  driver, which indicates the amount of data that the vendor driver has yet to
> > -  save for the VFIO device.
> > +* A ``save_query_pending`` function that reports the remaining pre-copy
> > +  data that the vendor driver has yet to save for the VFIO device.
> > 
> >   * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> >     active only when the VFIO device is in pre-copy states.
> > diff --git a/include/migration/register.h b/include/migration/register.h
> > index d0f37f5f43..2320c3a981 100644
> > --- a/include/migration/register.h
> > +++ b/include/migration/register.h
> > @@ -16,6 +16,15 @@
> > 
> >   #include "hw/core/vmstate-if.h"
> > 
> > +typedef struct MigPendingData {
> > +    /* How many bytes are pending for precopy / stopcopy? */
> > +    uint64_t precopy_bytes;
> > +    /* How many bytes are pending that can be transferred in postcopy? */
> > +    uint64_t postcopy_bytes;
> > +    /* Is this a fastpath query (which can be inaccurate)? */
> > +    bool fastpath;
> > +} MigPendingData ;
> 
> Extra space before semicolon.

Fixed.

> 
> > +
> >   /**
> >    * struct SaveVMHandlers: handler structure to finely control
> >    * migration of complex subsystems and devices, such as RAM, block and
> > @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
> >       bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
> > 
> >       /**
> > -     * @state_pending_estimate
> > -     *
> > -     * This estimates the remaining data to transfer
> > -     *
> > -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > -     * pending data.
> > -     *
> > -     * @opaque: data pointer passed to register_savevm_live()
> > -     * @must_precopy: amount of data that must be migrated in precopy
> > -     *                or in stopped state, i.e. that must be migrated
> > -     *                before target start.
> > -     * @can_postcopy: amount of data that can be migrated in postcopy
> > -     *                or in stopped state, i.e. after target start.
> > -     *                Some can also be migrated during precopy (RAM).
> > -     *                Some must be migrated after source stops
> > -     *                (block-dirty-bitmap)
> > -     */
> > -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> > -                                   uint64_t *can_postcopy);
> > -
> > -    /**
> > -     * @state_pending_exact
> > -     *
> > -     * This calculates the exact remaining data to transfer
> > +     * @save_query_pending
> >        *
> > -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > -     * pending data.
> > +     * This estimates the remaining data to transfer on the source side.
> > +     * It's highly suggested that the module should implement both fastpath
> > +     * and slowpath version of it when it can be slow (for more information
> > +     * please check pending->fastpath field).
> >        *
> >        * @opaque: data pointer passed to register_savevm_live()
> > -     * @must_precopy: amount of data that must be migrated in precopy
> > -     *                or in stopped state, i.e. that must be migrated
> > -     *                before target start.
> > -     * @can_postcopy: amount of data that can be migrated in postcopy
> > -     *                or in stopped state, i.e. after target start.
> > -     *                Some can also be migrated during precopy (RAM).
> > -     *                Some must be migrated after source stops
> > -     *                (block-dirty-bitmap)
> > +     * @pending: pointer to a MigPendingData struct
> >        */
> > -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> > -                                uint64_t *can_postcopy);
> > +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
> > 
> >       /**
> >        * @load_state
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index b3d1e8a13c..b116933bce 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -14,6 +14,8 @@
> >   #ifndef MIGRATION_SAVEVM_H
> >   #define MIGRATION_SAVEVM_H
> > 
> > +#include "migration/register.h"
> > +
> >   #define QEMU_VM_FILE_MAGIC           0x5145564d
> >   #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
> >   #define QEMU_VM_FILE_VERSION         0x00000003
> > @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> >   void qemu_savevm_state_cleanup(void);
> >   void qemu_savevm_state_complete_postcopy(QEMUFile *f);
> >   int qemu_savevm_state_complete_precopy(MigrationState *s);
> > +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
> >   void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> >                                        uint64_t *can_postcopy);
> >   void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> > diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> > index d808ece3b9..b1ec51c77a 100644
> > --- a/hw/s390x/s390-stattrib.c
> > +++ b/hw/s390x/s390-stattrib.c
> > @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
> >       return 0;
> >   }
> > 
> > -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> > -                               uint64_t *can_postcopy)
> > +static void cmma_state_pending(void *opaque, MigPendingData *pending)
> >   {
> >       S390StAttribState *sas = S390_STATTRIB(opaque);
> >       S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
> >       long long res = sac->get_dirtycount(sas);
> > 
> >       if (res >= 0) {
> > -        *must_precopy += res;
> > +        pending->precopy_bytes += res;
> >       }
> >   }
> > 
> > @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
> >       .save_setup = cmma_save_setup,
> >       .save_live_iterate = cmma_save_iterate,
> >       .save_complete = cmma_save_complete,
> > -    .state_pending_exact = cmma_state_pending,
> > -    .state_pending_estimate = cmma_state_pending,
> > +    .save_query_pending = cmma_state_pending,
> >       .save_cleanup = cmma_save_cleanup,
> >       .load_state = cmma_load,
> >       .is_active = cmma_active,
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 827d3ded63..c054c749b0 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
> >       trace_vfio_save_cleanup(vbasedev->name);
> >   }
> > 
> > -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> > -                                        uint64_t *can_postcopy)
> > +static void vfio_state_pending_sync(VFIODevice *vbasedev)
> >   {
> > -    VFIODevice *vbasedev = opaque;
> >       VFIOMigration *migration = vbasedev->migration;
> > 
> > -    if (!vfio_device_state_is_precopy(vbasedev)) {
> > -        return;
> > -    }
> > +    vfio_query_stop_copy_size(vbasedev);
> > 
> > -    *must_precopy +=
> > -        migration->precopy_init_size + migration->precopy_dirty_size;
> > +    if (vfio_device_state_is_precopy(vbasedev)) {
> > +        vfio_query_precopy_size(migration);
> > +    }
> > 
> > -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> > -                                      *can_postcopy,
> > -                                      migration->precopy_init_size,
> > -                                      migration->precopy_dirty_size);
> > +    /*
> > +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> > +     * Otherwise we have a problem.  So far, only dump some errors.
> > +     */
> 
> I'm not sure that must be true, as stopcopy and precopy queries are not
> atomic, so device state size can change between the queries.

I believe I wrote this only based on my (proven wrong) understanding of
stopcopy size, which I thought was an upper bound. It looks not..  If so,
yes, atomicity will be a problem..

Is below all the doc I can find regarding to this ioctl?  vfio.h has:

/*
 * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
 * be required to complete stop copy.
 *
 * Note: Can be called on each device state.
 */

struct vfio_device_feature_mig_data_size {
	__aligned_u64 stop_copy_length;
};

#define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9

I understand all driver authors want to be flexible, but I wonder if I'm
the only person keep getting confused by the VFIO migration API on
providing estimated results all over the places, and it'll be definitely
more confusing if it can shrink.  We can leave this part of discussion in
the other thread.

I don't think this check is a must; I can simply remove it, however please
help me to understand better on this interface.

Another note is that the goal of this series is to provide any way so the
mgmt can query pending information when VFIO is involved during migration.

This series only exposed (hopefully, much more accurate) expected_downtime.

When repost, I plan to add a system-wise remaining data field so the mgmt
can fetch for each of the iterations to predict any possible unconvergence,
for example, query remaining data and if it stops shrinking for a while it
is a simple sign of possible not converging.  It means either downtime
needs adjustments or something else (cancellation is also an option).

Avihai, if you think this goal will also be part of what you or your team
will be looking for, please let me know.  I am always open if you or
someone from your team thinks it easier to take this over as you know
better on this area.  Otherwise I'll respin after I figure out the rest
puzzles with your help.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-04-01 21:22     ` Peter Xu
@ 2026-04-06 11:54       ` Avihai Horon
  0 siblings, 0 replies; 67+ messages in thread
From: Avihai Horon @ 2026-04-06 11:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater, Halil Pasic,
	Christian Borntraeger, Jason Herne, Eric Farman, Matthew Rosato,
	Richard Henderson, Ilya Leoshkevich, David Hildenbrand,
	Cornelia Huck, Eric Blake, Vladimir Sementsov-Ogievskiy,
	John Snow


On 4/2/2026 12:22 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Mar 25, 2026 at 05:20:41PM +0200, Avihai Horon wrote:
>> On 3/20/2026 1:12 AM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> These two APIs are a slight duplication.  For example, there're a few users
>>> that directly pass in the same function.
>>>
>>> It might also be slightly error prone to provide two hooks, so that it's
>>> easier to happen that one module report different things via the two
>>> hooks. In reality they should always report the same thing, only about
>>> whether we should use a fast-path when the slow path might be too slow, and
>>> even if we need to pay with some less accuracy.
>>>
>>> Let's just merge it into one API, but instead provide a bool showing if the
>>> query is a fast query or not.
>>>
>>> No functional change intended.
>>>
>>> Export qemu_savevm_query_pending().  We should likely directly use the new
>>> API here provided when there're new users to do the query.  This will
>>> happen very soon.
>>>
>>> Cc: Halil Pasic <pasic@linux.ibm.com>
>>> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
>>> Cc: Jason Herne <jjherne@linux.ibm.com>
>>> Cc: Eric Farman <farman@linux.ibm.com>
>>> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
>>> Cc: Richard Henderson <richard.henderson@linaro.org>
>>> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
>>> Cc: David Hildenbrand <david@kernel.org>
>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>> Cc: Eric Blake <eblake@redhat.com>
>>> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>>> Cc: John Snow <jsnow@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    docs/devel/migration/main.rst  |  5 ++-
>>>    docs/devel/migration/vfio.rst  |  9 ++----
>>>    include/migration/register.h   | 52 ++++++++++--------------------
>>>    migration/savevm.h             |  3 ++
>>>    hw/s390x/s390-stattrib.c       |  8 ++---
>>>    hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
>>>    migration/block-dirty-bitmap.c |  9 ++----
>>>    migration/ram.c                | 32 ++++++-------------
>>>    migration/savevm.c             | 43 ++++++++++++-------------
>>>    hw/vfio/trace-events           |  3 +-
>>>    10 files changed, 93 insertions(+), 129 deletions(-)
>>>
>>> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
>>> index 234d280249..22c5910d5c 100644
>>> --- a/docs/devel/migration/main.rst
>>> +++ b/docs/devel/migration/main.rst
>>> @@ -519,9 +519,8 @@ An iterative device must provide:
>>>        data we must save.  The core migration code will use this to
>>>        determine when to pause the CPUs and complete the migration.
>> We should also remove the state_pending_exact section above.
> Ah yes, will do.
>
>>> -  - A ``state_pending_estimate`` function that indicates how much more
>>> -    data we must save.  When the estimated amount is smaller than the
>>> -    threshold, we call ``state_pending_exact``.
>>> +  - A ``save_query_pending`` function that indicates how much more
>>> +    data we must save.
>>>
>>>      - A ``save_live_iterate`` function should send a chunk of data until
>>>        the point that stream bandwidth limits tell it to stop.  Each call
>>> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
>>> index 0790e5031d..33768c877c 100644
>>> --- a/docs/devel/migration/vfio.rst
>>> +++ b/docs/devel/migration/vfio.rst
>>> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>>>    * A ``load_setup`` function that sets the VFIO device on the destination in
>>>      _RESUMING state.
>>>
>>> -* A ``state_pending_estimate`` function that reports an estimate of the
>>> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
>>> -  device.
>>> -
>>> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
>>> -  driver, which indicates the amount of data that the vendor driver has yet to
>>> -  save for the VFIO device.
>>> +* A ``save_query_pending`` function that reports the remaining pre-copy
>>> +  data that the vendor driver has yet to save for the VFIO device.
>>>
>>>    * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>>>      active only when the VFIO device is in pre-copy states.
>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>> index d0f37f5f43..2320c3a981 100644
>>> --- a/include/migration/register.h
>>> +++ b/include/migration/register.h
>>> @@ -16,6 +16,15 @@
>>>
>>>    #include "hw/core/vmstate-if.h"
>>>
>>> +typedef struct MigPendingData {
>>> +    /* How many bytes are pending for precopy / stopcopy? */
>>> +    uint64_t precopy_bytes;
>>> +    /* How many bytes are pending that can be transferred in postcopy? */
>>> +    uint64_t postcopy_bytes;
>>> +    /* Is this a fastpath query (which can be inaccurate)? */
>>> +    bool fastpath;
>>> +} MigPendingData ;
>> Extra space before semicolon.
> Fixed.
>
>>> +
>>>    /**
>>>     * struct SaveVMHandlers: handler structure to finely control
>>>     * migration of complex subsystems and devices, such as RAM, block and
>>> @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
>>>        bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>>>
>>>        /**
>>> -     * @state_pending_estimate
>>> -     *
>>> -     * This estimates the remaining data to transfer
>>> -     *
>>> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
>>> -     * pending data.
>>> -     *
>>> -     * @opaque: data pointer passed to register_savevm_live()
>>> -     * @must_precopy: amount of data that must be migrated in precopy
>>> -     *                or in stopped state, i.e. that must be migrated
>>> -     *                before target start.
>>> -     * @can_postcopy: amount of data that can be migrated in postcopy
>>> -     *                or in stopped state, i.e. after target start.
>>> -     *                Some can also be migrated during precopy (RAM).
>>> -     *                Some must be migrated after source stops
>>> -     *                (block-dirty-bitmap)
>>> -     */
>>> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
>>> -                                   uint64_t *can_postcopy);
>>> -
>>> -    /**
>>> -     * @state_pending_exact
>>> -     *
>>> -     * This calculates the exact remaining data to transfer
>>> +     * @save_query_pending
>>>         *
>>> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
>>> -     * pending data.
>>> +     * This estimates the remaining data to transfer on the source side.
>>> +     * It's highly suggested that the module should implement both fastpath
>>> +     * and slowpath version of it when it can be slow (for more information
>>> +     * please check pending->fastpath field).
>>>         *
>>>         * @opaque: data pointer passed to register_savevm_live()
>>> -     * @must_precopy: amount of data that must be migrated in precopy
>>> -     *                or in stopped state, i.e. that must be migrated
>>> -     *                before target start.
>>> -     * @can_postcopy: amount of data that can be migrated in postcopy
>>> -     *                or in stopped state, i.e. after target start.
>>> -     *                Some can also be migrated during precopy (RAM).
>>> -     *                Some must be migrated after source stops
>>> -     *                (block-dirty-bitmap)
>>> +     * @pending: pointer to a MigPendingData struct
>>>         */
>>> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
>>> -                                uint64_t *can_postcopy);
>>> +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
>>>
>>>        /**
>>>         * @load_state
>>> diff --git a/migration/savevm.h b/migration/savevm.h
>>> index b3d1e8a13c..b116933bce 100644
>>> --- a/migration/savevm.h
>>> +++ b/migration/savevm.h
>>> @@ -14,6 +14,8 @@
>>>    #ifndef MIGRATION_SAVEVM_H
>>>    #define MIGRATION_SAVEVM_H
>>>
>>> +#include "migration/register.h"
>>> +
>>>    #define QEMU_VM_FILE_MAGIC           0x5145564d
>>>    #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>>>    #define QEMU_VM_FILE_VERSION         0x00000003
>>> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>>>    void qemu_savevm_state_cleanup(void);
>>>    void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>>>    int qemu_savevm_state_complete_precopy(MigrationState *s);
>>> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
>>>    void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>>>                                         uint64_t *can_postcopy);
>>>    void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>>> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
>>> index d808ece3b9..b1ec51c77a 100644
>>> --- a/hw/s390x/s390-stattrib.c
>>> +++ b/hw/s390x/s390-stattrib.c
>>> @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>>>        return 0;
>>>    }
>>>
>>> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
>>> -                               uint64_t *can_postcopy)
>>> +static void cmma_state_pending(void *opaque, MigPendingData *pending)
>>>    {
>>>        S390StAttribState *sas = S390_STATTRIB(opaque);
>>>        S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>>>        long long res = sac->get_dirtycount(sas);
>>>
>>>        if (res >= 0) {
>>> -        *must_precopy += res;
>>> +        pending->precopy_bytes += res;
>>>        }
>>>    }
>>>
>>> @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>>>        .save_setup = cmma_save_setup,
>>>        .save_live_iterate = cmma_save_iterate,
>>>        .save_complete = cmma_save_complete,
>>> -    .state_pending_exact = cmma_state_pending,
>>> -    .state_pending_estimate = cmma_state_pending,
>>> +    .save_query_pending = cmma_state_pending,
>>>        .save_cleanup = cmma_save_cleanup,
>>>        .load_state = cmma_load,
>>>        .is_active = cmma_active,
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 827d3ded63..c054c749b0 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
>>>        trace_vfio_save_cleanup(vbasedev->name);
>>>    }
>>>
>>> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>>> -                                        uint64_t *can_postcopy)
>>> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>>>    {
>>> -    VFIODevice *vbasedev = opaque;
>>>        VFIOMigration *migration = vbasedev->migration;
>>>
>>> -    if (!vfio_device_state_is_precopy(vbasedev)) {
>>> -        return;
>>> -    }
>>> +    vfio_query_stop_copy_size(vbasedev);
>>>
>>> -    *must_precopy +=
>>> -        migration->precopy_init_size + migration->precopy_dirty_size;
>>> +    if (vfio_device_state_is_precopy(vbasedev)) {
>>> +        vfio_query_precopy_size(migration);
>>> +    }
>>>
>>> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>>> -                                      *can_postcopy,
>>> -                                      migration->precopy_init_size,
>>> -                                      migration->precopy_dirty_size);
>>> +    /*
>>> +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
>>> +     * Otherwise we have a problem.  So far, only dump some errors.
>>> +     */
>> I'm not sure that must be true, as stopcopy and precopy queries are not
>> atomic, so device state size can change between the queries.
> I believe I wrote this only based on my (proven wrong) understanding of
> stopcopy size, which I thought was an upper bound. It looks not..  If so,
> yes, atomicity will be a problem..

I think you can view it as an upper bound, in the sense that if you 
issue VFIO_DEVICE_FEATURE_MIG_DATA_SIZE and VFIO_MIG_GET_PRECOPY_INFO 
ioctls in the exact same time you should see stopcopy size >= precopy 
size. But here it's not the case, because they are not called in the 
same time.

> Is below all the doc I can find regarding to this ioctl?  vfio.h has:
>
> /*
>   * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
>   * be required to complete stop copy.
>   *
>   * Note: Can be called on each device state.
>   */
>
> struct vfio_device_feature_mig_data_size {
>          __aligned_u64 stop_copy_length;
> };
>
> #define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9
>
> I understand all driver authors want to be flexible, but I wonder if I'm
> the only person keep getting confused by the VFIO migration API on
> providing estimated results all over the places, and it'll be definitely
> more confusing if it can shrink.  We can leave this part of discussion in
> the other thread.
>
> I don't think this check is a must; I can simply remove it, however please
> help me to understand better on this interface.

Yes, that's all the docs.
Nothing much to add, as they say, it returns an estimate of the amount 
of data remaining to complete the migration for the VFIO device.
IIRC, it was introduced as a complementary for VFIO_MIG_GET_PRECOPY_INFO 
because not all VFIO devices support precopy or have all their data 
precopy-able.

>
> Another note is that the goal of this series is to provide any way so the
> mgmt can query pending information when VFIO is involved during migration.
>
> This series only exposed (hopefully, much more accurate) expected_downtime.
>
> When repost, I plan to add a system-wise remaining data field so the mgmt
> can fetch for each of the iterations to predict any possible unconvergence,
> for example, query remaining data and if it stops shrinking for a while it
> is a simple sign of possible not converging.  It means either downtime
> needs adjustments or something else (cancellation is also an option).
>
> Avihai, if you think this goal will also be part of what you or your team
> will be looking for, please let me know.  I am always open if you or
> someone from your team thinks it easier to take this over as you know
> better on this area.  Otherwise I'll respin after I figure out the rest
> puzzles with your help.

I think this series is good, just need to sort out the subtle VFIO parts.

Thanks.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
  2026-03-24 10:35   ` Prasad Pandit
  2026-03-25 15:20   ` Avihai Horon
@ 2026-03-27 15:17   ` Juraj Marcin
  2 siblings, 0 replies; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Halil Pasic, Christian Borntraeger, Jason Herne, Eric Farman,
	Matthew Rosato, Richard Henderson, Ilya Leoshkevich,
	David Hildenbrand, Cornelia Huck, Eric Blake,
	Vladimir Sementsov-Ogievskiy, John Snow

On 2026-03-19 19:12, Peter Xu wrote:
> These two APIs are a slight duplication.  For example, there're a few users
> that directly pass in the same function.
> 
> It might also be slightly error prone to provide two hooks, so that it's
> easier to happen that one module report different things via the two
> hooks. In reality they should always report the same thing, only about
> whether we should use a fast-path when the slow path might be too slow, and
> even if we need to pay with some less accuracy.
> 
> Let's just merge it into one API, but instead provide a bool showing if the
> query is a fast query or not.
> 
> No functional change intended.
> 
> Export qemu_savevm_query_pending().  We should likely directly use the new
> API here provided when there're new users to do the query.  This will
> happen very soon.
> 
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Jason Herne <jjherne@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Cc: John Snow <jsnow@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  docs/devel/migration/main.rst  |  5 ++-
>  docs/devel/migration/vfio.rst  |  9 ++----
>  include/migration/register.h   | 52 ++++++++++--------------------
>  migration/savevm.h             |  3 ++
>  hw/s390x/s390-stattrib.c       |  8 ++---
>  hw/vfio/migration.c            | 58 +++++++++++++++++++---------------
>  migration/block-dirty-bitmap.c |  9 ++----
>  migration/ram.c                | 32 ++++++-------------
>  migration/savevm.c             | 43 ++++++++++++-------------
>  hw/vfio/trace-events           |  3 +-
>  10 files changed, 93 insertions(+), 129 deletions(-)
> 
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 234d280249..22c5910d5c 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -519,9 +519,8 @@ An iterative device must provide:
>      data we must save.  The core migration code will use this to
>      determine when to pause the CPUs and complete the migration.
>  
> -  - A ``state_pending_estimate`` function that indicates how much more
> -    data we must save.  When the estimated amount is smaller than the
> -    threshold, we call ``state_pending_exact``.
> +  - A ``save_query_pending`` function that indicates how much more
> +    data we must save.
>  
>    - A ``save_live_iterate`` function should send a chunk of data until
>      the point that stream bandwidth limits tell it to stop.  Each call
> diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> index 0790e5031d..33768c877c 100644
> --- a/docs/devel/migration/vfio.rst
> +++ b/docs/devel/migration/vfio.rst
> @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
>  * A ``load_setup`` function that sets the VFIO device on the destination in
>    _RESUMING state.
>  
> -* A ``state_pending_estimate`` function that reports an estimate of the
> -  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> -  device.
> -
> -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> -  driver, which indicates the amount of data that the vendor driver has yet to
> -  save for the VFIO device.
> +* A ``save_query_pending`` function that reports the remaining pre-copy
> +  data that the vendor driver has yet to save for the VFIO device.
>  
>  * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>    active only when the VFIO device is in pre-copy states.
> diff --git a/include/migration/register.h b/include/migration/register.h
> index d0f37f5f43..2320c3a981 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -16,6 +16,15 @@
>  
>  #include "hw/core/vmstate-if.h"
>  
> +typedef struct MigPendingData {
> +    /* How many bytes are pending for precopy / stopcopy? */
> +    uint64_t precopy_bytes;
> +    /* How many bytes are pending that can be transferred in postcopy? */
> +    uint64_t postcopy_bytes;
> +    /* Is this a fastpath query (which can be inaccurate)? */
> +    bool fastpath;
> +} MigPendingData ;

I agree with Prasad, that name fastpath seems unnatural, I think using
estimate (or exact and invert conditions) would be better.

> +
>  /**
>   * struct SaveVMHandlers: handler structure to finely control
>   * migration of complex subsystems and devices, such as RAM, block and
> @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
>      bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
>  
>      /**
> -     * @state_pending_estimate
> -     *
> -     * This estimates the remaining data to transfer
> -     *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> -     *
> -     * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> -     */
> -    void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> -                                   uint64_t *can_postcopy);
> -
> -    /**
> -     * @state_pending_exact
> -     *
> -     * This calculates the exact remaining data to transfer
> +     * @save_query_pending
>       *
> -     * Sum of @can_postcopy and @must_postcopy is the whole amount of
> -     * pending data.
> +     * This estimates the remaining data to transfer on the source side.
> +     * It's highly suggested that the module should implement both fastpath
> +     * and slowpath version of it when it can be slow (for more information
> +     * please check pending->fastpath field).
>       *
>       * @opaque: data pointer passed to register_savevm_live()
> -     * @must_precopy: amount of data that must be migrated in precopy
> -     *                or in stopped state, i.e. that must be migrated
> -     *                before target start.
> -     * @can_postcopy: amount of data that can be migrated in postcopy
> -     *                or in stopped state, i.e. after target start.
> -     *                Some can also be migrated during precopy (RAM).
> -     *                Some must be migrated after source stops
> -     *                (block-dirty-bitmap)
> +     * @pending: pointer to a MigPendingData struct
>       */
> -    void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> -                                uint64_t *can_postcopy);
> +    void (*save_query_pending)(void *opaque, MigPendingData *pending);
>  
>      /**
>       * @load_state
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b3d1e8a13c..b116933bce 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -14,6 +14,8 @@
>  #ifndef MIGRATION_SAVEVM_H
>  #define MIGRATION_SAVEVM_H
>  
> +#include "migration/register.h"
> +
>  #define QEMU_VM_FILE_MAGIC           0x5145564d
>  #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
>  #define QEMU_VM_FILE_VERSION         0x00000003
> @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy);
>  void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> index d808ece3b9..b1ec51c77a 100644
> --- a/hw/s390x/s390-stattrib.c
> +++ b/hw/s390x/s390-stattrib.c
> @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
>      return 0;
>  }
>  
> -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> -                               uint64_t *can_postcopy)
> +static void cmma_state_pending(void *opaque, MigPendingData *pending)
>  {
>      S390StAttribState *sas = S390_STATTRIB(opaque);
>      S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
>      long long res = sac->get_dirtycount(sas);
>  
>      if (res >= 0) {
> -        *must_precopy += res;
> +        pending->precopy_bytes += res;
>      }
>  }
>  
> @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
>      .save_setup = cmma_save_setup,
>      .save_live_iterate = cmma_save_iterate,
>      .save_complete = cmma_save_complete,
> -    .state_pending_exact = cmma_state_pending,
> -    .state_pending_estimate = cmma_state_pending,
> +    .save_query_pending = cmma_state_pending,
>      .save_cleanup = cmma_save_cleanup,
>      .load_state = cmma_load,
>      .is_active = cmma_active,
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 827d3ded63..c054c749b0 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +static void vfio_state_pending_sync(VFIODevice *vbasedev)
>  {
> -    VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>  
> -    if (!vfio_device_state_is_precopy(vbasedev)) {
> -        return;
> -    }
> +    vfio_query_stop_copy_size(vbasedev);
>  
> -    *must_precopy +=
> -        migration->precopy_init_size + migration->precopy_dirty_size;
> +    if (vfio_device_state_is_precopy(vbasedev)) {
> +        vfio_query_precopy_size(migration);
> +    }
>  
> -    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> -                                      *can_postcopy,
> -                                      migration->precopy_init_size,
> -                                      migration->precopy_dirty_size);
> +    /*
> +     * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> +     * Otherwise we have a problem.  So far, only dump some errors.
> +     */
> +    if (migration->precopy_init_size + migration->precopy_dirty_size <
> +        migration->stopcopy_size) {
> +        error_report_once("%s: wrong pending data (init=%" PRIx64
> +                          ", dirty=%"PRIx64", stop=%"PRIx64")",
> +                          __func__, migration->precopy_init_size,
> +                          migration->precopy_dirty_size,
> +                          migration->stopcopy_size);
> +    }
>  }
>  
> -static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> +static void vfio_state_pending(void *opaque, MigPendingData *pending)
>  {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
> +    uint64_t remain;
>  
> -    vfio_query_stop_copy_size(vbasedev);
> -    *must_precopy += migration->stopcopy_size;
> -
> -    if (vfio_device_state_is_precopy(vbasedev)) {
> -        vfio_query_precopy_size(migration);
> +    if (pending->fastpath) {
> +        if (!vfio_device_state_is_precopy(vbasedev)) {
> +            return;
> +        }
> +        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> +    } else {
> +        vfio_state_pending_sync(vbasedev);
> +        remain = migration->stopcopy_size;
>      }
>  
> -    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   migration->stopcopy_size,
> -                                   migration->precopy_init_size,
> -                                   migration->precopy_dirty_size);
> +    pending->precopy_bytes += remain;
> +
> +    trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> +                             migration->precopy_init_size,
> +                             migration->precopy_dirty_size);
>  }
>  
>  static bool vfio_is_active_iterate(void *opaque)
> @@ -850,8 +859,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>      .save_prepare = vfio_save_prepare,
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> -    .state_pending_estimate = vfio_state_pending_estimate,
> -    .state_pending_exact = vfio_state_pending_exact,
> +    .save_query_pending = vfio_state_pending,
>      .is_active_iterate = vfio_is_active_iterate,
>      .save_live_iterate = vfio_save_iterate,
>      .save_complete = vfio_save_complete_precopy,
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index a061aad817..376a9b43ac 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -766,9 +766,7 @@ static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
>      return 0;
>  }
>  
> -static void dirty_bitmap_state_pending(void *opaque,
> -                                       uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> +static void dirty_bitmap_state_pending(void *opaque, MigPendingData *data)
>  {
>      DBMSaveState *s = &((DBMState *)opaque)->save;
>      SaveBitmapState *dbms;
> @@ -788,7 +786,7 @@ static void dirty_bitmap_state_pending(void *opaque,
>  
>      trace_dirty_bitmap_state_pending(pending);
>  
> -    *can_postcopy += pending;
> +    data->postcopy_bytes += pending;
>  }
>  
>  /* First occurrence of this bitmap. It should be created if doesn't exist */
> @@ -1250,8 +1248,7 @@ static SaveVMHandlers savevm_dirty_bitmap_handlers = {
>      .save_setup = dirty_bitmap_save_setup,
>      .save_complete = dirty_bitmap_save_complete,
>      .has_postcopy = dirty_bitmap_has_postcopy,
> -    .state_pending_exact = dirty_bitmap_state_pending,
> -    .state_pending_estimate = dirty_bitmap_state_pending,
> +    .save_query_pending = dirty_bitmap_state_pending,
>      .save_live_iterate = dirty_bitmap_save_iterate,
>      .is_active_iterate = dirty_bitmap_is_active_iterate,
>      .load_state = dirty_bitmap_load,
> diff --git a/migration/ram.c b/migration/ram.c
> index 979751f61b..89f761a471 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3443,30 +3443,17 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      return qemu_fflush(f);
>  }
>  
> -static void ram_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> -                                       uint64_t *can_postcopy)
> -{
> -    RAMState **temp = opaque;
> -    RAMState *rs = *temp;
> -
> -    uint64_t remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -
> -    if (migrate_postcopy_ram()) {
> -        /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> -    } else {
> -        *must_precopy += remaining_size;
> -    }
> -}
> -
> -static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
> -                                    uint64_t *can_postcopy)
> +static void ram_state_pending(void *opaque, MigPendingData *pending)
>  {
>      RAMState **temp = opaque;
>      RAMState *rs = *temp;
>      uint64_t remaining_size;
>  
> -    if (!migration_in_postcopy()) {
> +    /*
> +     * Sync is only needed either with: (1) a fast query, or (2) postcopy
> +     * as started (in which case no new dirty will generate anymore).
> +     */
> +    if (!pending->fastpath && !migration_in_postcopy()) {
>          bql_lock();
>          WITH_RCU_READ_LOCK_GUARD() {
>              migration_bitmap_sync_precopy(false);
> @@ -3478,9 +3465,9 @@ static void ram_state_pending_exact(void *opaque, uint64_t *must_precopy,
>  
>      if (migrate_postcopy_ram()) {
>          /* We can do postcopy, and all the data is postcopiable */
> -        *can_postcopy += remaining_size;
> +        pending->postcopy_bytes += remaining_size;
>      } else {
> -        *must_precopy += remaining_size;
> +        pending->precopy_bytes += remaining_size;
>      }
>  }
>  
> @@ -4703,8 +4690,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>      .save_live_iterate = ram_save_iterate,
>      .save_complete = ram_save_complete,
>      .has_postcopy = ram_has_postcopy,
> -    .state_pending_exact = ram_state_pending_exact,
> -    .state_pending_estimate = ram_state_pending_estimate,
> +    .save_query_pending = ram_state_pending,
>      .load_state = ram_load,
>      .save_cleanup = ram_save_cleanup,
>      .load_setup = ram_load_setup,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index dd58f2a705..6268e68382 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1762,46 +1762,45 @@ int qemu_savevm_state_complete_precopy(MigrationState *s)
>      return qemu_fflush(f);
>  }
>  
> -/* Give an estimate of the amount left to be transferred,
> - * the result is split into the amount for units that can and
> - * for units that can't do postcopy.
> - */
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>  {
>      SaveStateEntry *se;
>  
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    pending->precopy_bytes = 0;
> +    pending->postcopy_bytes = 0;
> +    pending->fastpath = fastpath;
>  
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_estimate) {
> +        if (!se->ops || !se->ops->save_query_pending) {
>              continue;
>          }
>          if (!qemu_savevm_state_active(se)) {
>              continue;
>          }
> -        se->ops->state_pending_estimate(se->opaque, must_precopy, can_postcopy);
> +        se->ops->save_query_pending(se->opaque, pending);
>      }
>  }
>  
> +void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    MigPendingData pending;
> +
> +    qemu_savevm_query_pending(&pending, true);
> +
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
> +}
> +
>  void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
>  {
> -    SaveStateEntry *se;
> +    MigPendingData pending;
>  
> -    *must_precopy = 0;
> -    *can_postcopy = 0;
> +    qemu_savevm_query_pending(&pending, false);
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops || !se->ops->state_pending_exact) {
> -            continue;
> -        }
> -        if (!qemu_savevm_state_active(se)) {
> -            continue;
> -        }
> -        se->ops->state_pending_exact(se->opaque, must_precopy, can_postcopy);
> -    }
> +    *must_precopy = pending.precopy_bytes;
> +    *can_postcopy = pending.postcopy_bytes;
>  }
>  
>  void qemu_savevm_state_cleanup(void)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 846e3625c5..7cf5a9eb2d 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -173,8 +173,7 @@ vfio_save_device_config_state(const char *name) " (%s)"
>  vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>  vfio_save_iterate_start(const char *name) " (%s)"
>  vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size %"PRIu64
> -vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy %"PRIu64" postcopy %"PRIu64" stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
> +vfio_state_pending(const char *name, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) stopcopy size %"PRIu64" precopy initial size %"PRIu64" precopy dirty size %"PRIu64
>  vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>  vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>  
> -- 
> 2.50.1
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (4 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-24  9:35   ` Prasad Pandit
  2026-03-27 15:24   ` Juraj Marcin
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

It's easier to use the new API directly in the migration iterations.  This
also paves way for follow up patches to add new data to report directly to
the iterator function.

When at it, merge the tracepoints too into one.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.h     |  4 ----
 migration/migration.c  | 16 +++++++---------
 migration/savevm.c     | 23 ++---------------------
 migration/trace-events |  1 +
 4 files changed, 10 insertions(+), 34 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index b116933bce..3c39190fa4 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -46,10 +46,6 @@ void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(MigrationState *s);
 void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
-                                     uint64_t *can_postcopy);
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy);
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
 bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
 void qemu_savevm_state_end(QEMUFile *f);
diff --git a/migration/migration.c b/migration/migration.c
index dfc60372cf..99c4d09000 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3204,17 +3204,17 @@ typedef enum {
  */
 static MigIterateState migration_iteration_run(MigrationState *s)
 {
-    uint64_t must_precopy, can_postcopy, pending_size;
     Error *local_err = NULL;
     bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
+    MigPendingData pending = { };
+    uint64_t pending_size;
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
-    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
-    pending_size = must_precopy + can_postcopy;
-    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
+    qemu_savevm_query_pending(&pending, true);
+    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
 
     if (in_postcopy) {
         /*
@@ -3243,14 +3243,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * during postcopy phase.
          */
         if (pending_size <= s->threshold_size) {
-            qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
-            pending_size = must_precopy + can_postcopy;
-            trace_migrate_pending_exact(pending_size, must_precopy,
-                                        can_postcopy);
+            qemu_savevm_query_pending(&pending, false);
+            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
         }
 
         /* Should we switch to postcopy now? */
-        if (must_precopy <= s->threshold_size &&
+        if (pending.precopy_bytes <= s->threshold_size &&
             can_switchover && qatomic_read(&s->start_postcopy)) {
             if (postcopy_start(s, &local_err)) {
                 migrate_error_propagate(s, error_copy(local_err));
diff --git a/migration/savevm.c b/migration/savevm.c
index 6268e68382..b3285d480f 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1779,28 +1779,9 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
         }
         se->ops->save_query_pending(se->opaque, pending);
     }
-}
-
-void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
-                                        uint64_t *can_postcopy)
-{
-    MigPendingData pending;
-
-    qemu_savevm_query_pending(&pending, true);
-
-    *must_precopy = pending.precopy_bytes;
-    *can_postcopy = pending.postcopy_bytes;
-}
-
-void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
-                                     uint64_t *can_postcopy)
-{
-    MigPendingData pending;
-
-    qemu_savevm_query_pending(&pending, false);
 
-    *must_precopy = pending.precopy_bytes;
-    *can_postcopy = pending.postcopy_bytes;
+    trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
+                                    pending->postcopy_bytes);
 }
 
 void qemu_savevm_state_cleanup(void)
diff --git a/migration/trace-events b/migration/trace-events
index 60e5087e38..5f836a8652 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
+qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
@ 2026-03-24  9:35   ` Prasad Pandit
  2026-03-27 15:24   ` Juraj Marcin
  1 sibling, 0 replies; 67+ messages in thread
From: Prasad Pandit @ 2026-03-24  9:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> It's easier to use the new API directly in the migration iterations.  This
> also paves way for follow up patches to add new data to report directly to
> the iterator function.
>
> When at it, merge the tracepoints too into one.
>
> No functional change intended.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/savevm.h     |  4 ----
>  migration/migration.c  | 16 +++++++---------
>  migration/savevm.c     | 23 ++---------------------
>  migration/trace-events |  1 +
>  4 files changed, 10 insertions(+), 34 deletions(-)
>
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b116933bce..3c39190fa4 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -46,10 +46,6 @@ void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
>  void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
> -void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy);
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy);

* +1.

>  int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
>  bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
>  void qemu_savevm_state_end(QEMUFile *f);
> diff --git a/migration/migration.c b/migration/migration.c
> index dfc60372cf..99c4d09000 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3204,17 +3204,17 @@ typedef enum {
>   */
>  static MigIterateState migration_iteration_run(MigrationState *s)
>  {
> -    uint64_t must_precopy, can_postcopy, pending_size;
>      Error *local_err = NULL;
>      bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
> +    MigPendingData pending = { };
> +    uint64_t pending_size;
>      bool complete_ready;
>
>      /* Fast path - get the estimated amount of pending data */
> -    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
> -    pending_size = must_precopy + can_postcopy;
> -    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
> +    qemu_savevm_query_pending(&pending, true);
> +    pending_size = pending.precopy_bytes + pending.postcopy_bytes;

* Why do we have 'precopy_bytes' & 'postcopy_bytes'? What if
qemu_savevm_query_pending() returned the pending data size?

>      if (in_postcopy) {
>          /*
> @@ -3243,14 +3243,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * during postcopy phase.
>           */
>          if (pending_size <= s->threshold_size) {
> -            qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
> -            pending_size = must_precopy + can_postcopy;
> -            trace_migrate_pending_exact(pending_size, must_precopy,
> -                                        can_postcopy);
> +            qemu_savevm_query_pending(&pending, false);
> +            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>          }
>
>          /* Should we switch to postcopy now? */
> -        if (must_precopy <= s->threshold_size &&
> +        if (pending.precopy_bytes <= s->threshold_size &&
>              can_switchover && qatomic_read(&s->start_postcopy)) {
>              if (postcopy_start(s, &local_err)) {
>                  migrate_error_propagate(s, error_copy(local_err));
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 6268e68382..b3285d480f 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1779,28 +1779,9 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>          }
>          se->ops->save_query_pending(se->opaque, pending);
>      }
> -}
> -
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> -{
> -    MigPendingData pending;
> -
> -    qemu_savevm_query_pending(&pending, true);
> -
> -    *must_precopy = pending.precopy_bytes;
> -    *can_postcopy = pending.postcopy_bytes;
> -}
> -
> -void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> -{
> -    MigPendingData pending;
> -
> -    qemu_savevm_query_pending(&pending, false);
>
> -    *must_precopy = pending.precopy_bytes;
> -    *can_postcopy = pending.postcopy_bytes;
> +    trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->postcopy_bytes);
>  }
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
  2026-03-24  9:35   ` Prasad Pandit
@ 2026-03-27 15:24   ` Juraj Marcin
  2026-04-01 22:28     ` Peter Xu
  1 sibling, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 15:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On 2026-03-19 19:12, Peter Xu wrote:
> It's easier to use the new API directly in the migration iterations.  This
> also paves way for follow up patches to add new data to report directly to
> the iterator function.
> 
> When at it, merge the tracepoints too into one.
> 
> No functional change intended.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/savevm.h     |  4 ----
>  migration/migration.c  | 16 +++++++---------
>  migration/savevm.c     | 23 ++---------------------
>  migration/trace-events |  1 +
>  4 files changed, 10 insertions(+), 34 deletions(-)
> 
> diff --git a/migration/savevm.h b/migration/savevm.h
> index b116933bce..3c39190fa4 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -46,10 +46,6 @@ void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(MigrationState *s);
>  void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
> -void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy);
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy);
>  int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy);
>  bool qemu_savevm_state_postcopy_prepare(QEMUFile *f, Error **errp);
>  void qemu_savevm_state_end(QEMUFile *f);
> diff --git a/migration/migration.c b/migration/migration.c
> index dfc60372cf..99c4d09000 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3204,17 +3204,17 @@ typedef enum {
>   */
>  static MigIterateState migration_iteration_run(MigrationState *s)
>  {
> -    uint64_t must_precopy, can_postcopy, pending_size;
>      Error *local_err = NULL;
>      bool in_postcopy = (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE ||
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
> +    MigPendingData pending = { };
> +    uint64_t pending_size;
>      bool complete_ready;
>  
>      /* Fast path - get the estimated amount of pending data */
> -    qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
> -    pending_size = must_precopy + can_postcopy;
> -    trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
> +    qemu_savevm_query_pending(&pending, true);
> +    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>  
>      if (in_postcopy) {
>          /*
> @@ -3243,14 +3243,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * during postcopy phase.
>           */
>          if (pending_size <= s->threshold_size) {
> -            qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
> -            pending_size = must_precopy + can_postcopy;
> -            trace_migrate_pending_exact(pending_size, must_precopy,
> -                                        can_postcopy);
> +            qemu_savevm_query_pending(&pending, false);
> +            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>          }
>  
>          /* Should we switch to postcopy now? */
> -        if (must_precopy <= s->threshold_size &&
> +        if (pending.precopy_bytes <= s->threshold_size &&
>              can_switchover && qatomic_read(&s->start_postcopy)) {
>              if (postcopy_start(s, &local_err)) {
>                  migrate_error_propagate(s, error_copy(local_err));
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 6268e68382..b3285d480f 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1779,28 +1779,9 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>          }
>          se->ops->save_query_pending(se->opaque, pending);
>      }
> -}
> -
> -void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> -                                        uint64_t *can_postcopy)
> -{
> -    MigPendingData pending;
> -
> -    qemu_savevm_query_pending(&pending, true);
> -
> -    *must_precopy = pending.precopy_bytes;
> -    *can_postcopy = pending.postcopy_bytes;
> -}
> -
> -void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> -                                     uint64_t *can_postcopy)
> -{
> -    MigPendingData pending;
> -
> -    qemu_savevm_query_pending(&pending, false);
>  
> -    *must_precopy = pending.precopy_bytes;
> -    *can_postcopy = pending.postcopy_bytes;
> +    trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->postcopy_bytes);
>  }
>  
>  void qemu_savevm_state_cleanup(void)
> diff --git a/migration/trace-events b/migration/trace-events
> index 60e5087e38..5f836a8652 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64

Should probably remove these two trace events, or is keeping them
intentional?

    migrate_pending_exact(uint64_t size, uint64_t pre, uint64_t post) "exact pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
    migrate_pending_estimate(uint64_t size, uint64_t pre, uint64_t post) "estimate pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"

>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> -- 
> 2.50.1
> 

With unused trace events comment resolved,

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly
  2026-03-27 15:24   ` Juraj Marcin
@ 2026-04-01 22:28     ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-01 22:28 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, Mar 27, 2026 at 04:24:09PM +0100, Juraj Marcin wrote:
> > diff --git a/migration/trace-events b/migration/trace-events
> > index 60e5087e38..5f836a8652 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
> >  qemu_loadvm_state_post_main(int ret) "%d"
> >  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> >  qemu_savevm_send_packaged(void) ""
> > +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> 
> Should probably remove these two trace events, or is keeping them
> intentional?
> 
>     migrate_pending_exact(uint64_t size, uint64_t pre, uint64_t post) "exact pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"
>     migrate_pending_estimate(uint64_t size, uint64_t pre, uint64_t post) "estimate pending size %" PRIu64 " (pre = %" PRIu64 " post=%" PRIu64 ")"

Nah, I forgot about them..  I'll remove them.

> 
> >  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
> >  loadvm_state_setup(void) ""
> >  loadvm_state_cleanup(void) ""
> > -- 
> > 2.50.1
> > 
> 
> With unused trace events comment resolved,
> 
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (5 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-24 11:05   ` Prasad Pandit
                     ` (2 more replies)
  2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
                   ` (4 subsequent siblings)
  11 siblings, 3 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

Allow modules to report data that can only be migrated after VM is stopped.

When this concept is introduced, we will need to account stopcopy size to
be part of pending_size as before.

One thing to mention is, when there can be stopcopy size, it means the old
"pending_size" may not always be able to reach low enough to kickoff an
slow version of query sync.  While it used to be almost guaranteed to
happen because if we keep iterating, normally pending_size can go to zero
for precopy-only because we assume everything reported can be migrated in
precopy phase.

So we need to make sure QEMU will kickoff a synchronized version of query
pending when all precopy data is migrated too.  This might be important to
VFIO to keep making progress even if the downtime cannot yet be satisfied.

So far, this patch should introduce no functional change, as no module yet
report stopcopy size.

This will pave way for VFIO to properly report its pending data sizes,
which was actually buggy today.  Will be done in follow up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/register.h | 12 +++++++++
 migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
 migration/savevm.c           |  7 +++--
 migration/trace-events       |  2 +-
 4 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index 2320c3a981..3824958ba5 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -17,12 +17,24 @@
 #include "hw/core/vmstate-if.h"
 
 typedef struct MigPendingData {
+    /*
+     * Modules can only update these fields in a query request via its
+     * save_query_pending() API.
+     */
     /* How many bytes are pending for precopy / stopcopy? */
     uint64_t precopy_bytes;
     /* How many bytes are pending that can be transferred in postcopy? */
     uint64_t postcopy_bytes;
+    /* How many bytes that can only be transferred when VM stopped? */
+    uint64_t stopcopy_bytes;
+
+    /*
+     * Modules should never update these fields.
+     */
     /* Is this a fastpath query (which can be inaccurate)? */
     bool fastpath;
+    /* Total pending data */
+    uint64_t total_bytes;
 } MigPendingData ;
 
 /**
diff --git a/migration/migration.c b/migration/migration.c
index 99c4d09000..42facb16d1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3198,6 +3198,44 @@ typedef enum {
     MIG_ITERATE_BREAK,          /* Break the loop */
 } MigIterateState;
 
+/* Are we ready to move to the next iteration phase? */
+static bool migration_iteration_next_ready(MigrationState *s,
+                                           MigPendingData *pending)
+{
+    /*
+     * If the estimated values already suggest us to switchover, mark this
+     * iteration finished, time to do a slow sync.
+     */
+    if (pending->total_bytes <= s->threshold_size) {
+        return true;
+    }
+
+    /*
+     * Since we may have modules reporting stop-only data, we also want to
+     * re-query with slow mode if all precopy data is moved over.  This
+     * will also mark the current iteration done.
+     *
+     * This could happen when e.g. a module (like, VFIO) reports stopcopy
+     * size too large so it will never yet satisfy the downtime with the
+     * current setup (above check).  Here, slow version of re-query helps
+     * because we keep trying the best to move whatever we have.
+     */
+    if (pending->precopy_bytes == 0) {
+        return true;
+    }
+
+    return false;
+}
+
+static void migration_iteration_go_next(MigPendingData *pending)
+{
+    /*
+     * Do a slow sync will achieve this.  TODO: move RAM iteration code
+     * into the core layer.
+     */
+    qemu_savevm_query_pending(pending, false);
+}
+
 /*
  * Return true if continue to the next iteration directly, false
  * otherwise.
@@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
                         s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
     bool can_switchover = migration_can_switchover(s);
     MigPendingData pending = { };
-    uint64_t pending_size;
     bool complete_ready;
 
     /* Fast path - get the estimated amount of pending data */
     qemu_savevm_query_pending(&pending, true);
-    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
 
     if (in_postcopy) {
         /*
@@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy completion doesn't rely on can_switchover, because when
          * POSTCOPY_ACTIVE it means switchover already happened.
          */
-        complete_ready = !pending_size;
+        complete_ready = !pending.total_bytes;
         if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
             (s->postcopy_package_loaded || complete_ready)) {
             /*
@@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size <= s->threshold_size) {
-            qemu_savevm_query_pending(&pending, false);
-            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
+        if (migration_iteration_next_ready(s, &pending)) {
+            migration_iteration_go_next(&pending);
         }
 
         /* Should we switch to postcopy now? */
@@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * (2) Pending size is no more than the threshold specified
          *     (which was calculated from expected downtime)
          */
-        complete_ready = can_switchover && (pending_size <= s->threshold_size);
+        complete_ready = can_switchover &&
+            (pending.total_bytes <= s->threshold_size);
     }
 
     if (complete_ready) {
-        trace_migration_thread_low_pending(pending_size);
+        trace_migration_thread_low_pending(pending.total_bytes);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }
diff --git a/migration/savevm.c b/migration/savevm.c
index b3285d480f..812c72b3e5 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
 {
     SaveStateEntry *se;
 
-    pending->precopy_bytes = 0;
-    pending->postcopy_bytes = 0;
+    memset(pending, 0, sizeof(*pending));
     pending->fastpath = fastpath;
 
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
@@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
         se->ops->save_query_pending(se->opaque, pending);
     }
 
+    pending->total_bytes = pending->precopy_bytes +
+        pending->stopcopy_bytes + pending->postcopy_bytes;
+
     trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
+                                    pending->stopcopy_bytes,
                                     pending->postcopy_bytes);
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 5f836a8652..175f09f8ad 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
-qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
+qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
 loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
@ 2026-03-24 11:05   ` Prasad Pandit
  2026-03-25 16:54   ` Avihai Horon
  2026-03-27 16:43   ` Juraj Marcin
  2 siblings, 0 replies; 67+ messages in thread
From: Prasad Pandit @ 2026-03-24 11:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> Allow modules to report data that can only be migrated after VM is stopped.
>
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
>
> One thing to mention is, when there can be stopcopy size, it means the old
> "pending_size" may not always be able to reach low enough to kickoff an
> slow version of query sync.  While it used to be almost guaranteed to
> happen because if we keep iterating, normally pending_size can go to zero
> for precopy-only because we assume everything reported can be migrated in
> precopy phase.
>
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated too.  This might be important to
> VFIO to keep making progress even if the downtime cannot yet be satisfied.
>
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
>
> This will pave way for VFIO to properly report its pending data sizes,
> which was actually buggy today.  Will be done in follow up patches.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/migration/register.h | 12 +++++++++
>  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>  migration/savevm.c           |  7 +++--
>  migration/trace-events       |  2 +-
>  4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 2320c3a981..3824958ba5 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -17,12 +17,24 @@
>  #include "hw/core/vmstate-if.h"
>
>  typedef struct MigPendingData {
> +    /*
> +     * Modules can only update these fields in a query request via its
> +     * save_query_pending() API.
> +     */
>      /* How many bytes are pending for precopy / stopcopy? */
>      uint64_t precopy_bytes;
>      /* How many bytes are pending that can be transferred in postcopy? */
>      uint64_t postcopy_bytes;
> +    /* How many bytes that can only be transferred when VM stopped? */
> +    uint64_t stopcopy_bytes;

* This differentiation of pending bytes into
precopy/postcopy/stopcopy/total could become confusing, because their
intention isn't readily clear. Pending bytes indicates bytes still
pending or waiting to be sent. So is not the case with postcopy_bytes
and stopcopy_bytes and total_bytes. Do we really need this separation?

> +    /*
> +     * Modules should never update these fields.
> +     */
>      /* Is this a fastpath query (which can be inaccurate)? */
>      bool fastpath;

* Fast & slow adjectives generally go with speed, rather than accuracy
or rough estimates. Here its usage is more to decide whether to return
an estimated value OR an accurate value.It'll help to rename it to
something suitable for accuracy/precision.

> +    /* Total pending data */
> +    uint64_t total_bytes;
>  } MigPendingData ;
>
>  /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 99c4d09000..42facb16d1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>      MIG_ITERATE_BREAK,          /* Break the loop */
>  } MigIterateState;
>
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, false);
> +}

* What is _go_next() function used for? To  trigger
migration_bitmap_sync_precopy() call via   _savevm_query_pending(,
false)?

>  /*
>   * Return true if continue to the next iteration directly, false
>   * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
>      MigPendingData pending = { };
> -    uint64_t pending_size;
>      bool complete_ready;
>
>      /* Fast path - get the estimated amount of pending data */
>      qemu_savevm_query_pending(&pending, true);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>
>      if (in_postcopy) {
>          /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy completion doesn't rely on can_switchover, because when
>           * POSTCOPY_ACTIVE it means switchover already happened.
>           */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>          if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>              (s->postcopy_package_loaded || complete_ready)) {
>              /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, false);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>          }
>
>          /* Should we switch to postcopy now? */
> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * (2) Pending size is no more than the threshold specified
>           *     (which was calculated from expected downtime)
>           */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);
>      }
>
>      if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b3285d480f..812c72b3e5 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>  {
>      SaveStateEntry *se;
>
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>      pending->fastpath = fastpath;
>
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> @@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>          se->ops->save_query_pending(se->opaque, pending);
>      }
>
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>      trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                      pending->postcopy_bytes);
>  }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f836a8652..175f09f8ad 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> --
> 2.50.1
>
>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
  2026-03-24 11:05   ` Prasad Pandit
@ 2026-03-25 16:54   ` Avihai Horon
  2026-04-02 14:09     ` Peter Xu
  2026-03-27 16:43   ` Juraj Marcin
  2 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-03-25 16:54 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Allow modules to report data that can only be migrated after VM is stopped.
>
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
>
> One thing to mention is, when there can be stopcopy size, it means the old
> "pending_size" may not always be able to reach low enough to kickoff an
> slow version of query sync.  While it used to be almost guaranteed to
> happen because if we keep iterating, normally pending_size can go to zero
> for precopy-only because we assume everything reported can be migrated in
> precopy phase.
>
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated too.  This might be important to
> VFIO to keep making progress even if the downtime cannot yet be satisfied.
>
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
>
> This will pave way for VFIO to properly report its pending data sizes,
> which was actually buggy today.  Will be done in follow up patches.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/migration/register.h | 12 +++++++++
>   migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>   migration/savevm.c           |  7 +++--
>   migration/trace-events       |  2 +-
>   4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 2320c3a981..3824958ba5 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -17,12 +17,24 @@
>   #include "hw/core/vmstate-if.h"
>
>   typedef struct MigPendingData {
> +    /*
> +     * Modules can only update these fields in a query request via its
> +     * save_query_pending() API.
> +     */

Move comment to patch #5?

>       /* How many bytes are pending for precopy / stopcopy? */
>       uint64_t precopy_bytes;
>       /* How many bytes are pending that can be transferred in postcopy? */
>       uint64_t postcopy_bytes;
> +    /* How many bytes that can only be transferred when VM stopped? */
> +    uint64_t stopcopy_bytes;

Keep consistent phrasing?

/* Amount of pending bytes that can be transferred either in precopy or 
stopcopy */
uint64_t precopy_bytes;
/* Amount of pending bytes that can be transferred in postcopy */
uint64_t postcopy_bytes;
/* Amount of pending bytes that can be transferred only in stopcopy */
uint64_t stopcopy_bytes;

> +
> +    /*
> +     * Modules should never update these fields.
> +     */

Move comment to patch #5?

>       /* Is this a fastpath query (which can be inaccurate)? */
>       bool fastpath;
> +    /* Total pending data */
> +    uint64_t total_bytes;
>   } MigPendingData ;
>
>   /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 99c4d09000..42facb16d1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>       MIG_ITERATE_BREAK,          /* Break the loop */
>   } MigIterateState;
>
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, false);
> +}

I think the iteration terminology here could be confusing, because these 
two functions are called from migration_iteration_run() and they don't 
refer to the same iteration concept.
How about migration_pass_next_ready/migration_pass_go_next?
Or migration_dirty_sync_ready/migration_dirty_sync?

Thanks.

> +
>   /*
>    * Return true if continue to the next iteration directly, false
>    * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                           s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>       bool can_switchover = migration_can_switchover(s);
>       MigPendingData pending = { };
> -    uint64_t pending_size;
>       bool complete_ready;
>
>       /* Fast path - get the estimated amount of pending data */
>       qemu_savevm_query_pending(&pending, true);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>
>       if (in_postcopy) {
>           /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * postcopy completion doesn't rely on can_switchover, because when
>            * POSTCOPY_ACTIVE it means switchover already happened.
>            */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>           if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>               (s->postcopy_package_loaded || complete_ready)) {
>               /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * postcopy started, so ESTIMATE should always match with EXACT
>            * during postcopy phase.
>            */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, false);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>           }
>
>           /* Should we switch to postcopy now? */
> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>            * (2) Pending size is no more than the threshold specified
>            *     (which was calculated from expected downtime)
>            */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);
>       }
>
>       if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>           migration_completion(s);
>           return MIG_ITERATE_BREAK;
>       }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b3285d480f..812c72b3e5 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>   {
>       SaveStateEntry *se;
>
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>       pending->fastpath = fastpath;
>
>       QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> @@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>           se->ops->save_query_pending(se->opaque, pending);
>       }
>
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>       trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                       pending->postcopy_bytes);
>   }
>
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f836a8652..175f09f8ad 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>   qemu_loadvm_state_post_main(int ret) "%d"
>   qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>   qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>   loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>   loadvm_state_setup(void) ""
>   loadvm_state_cleanup(void) ""
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-25 16:54   ` Avihai Horon
@ 2026-04-02 14:09     ` Peter Xu
  2026-04-06 12:20       ` Avihai Horon
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-02 14:09 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Wed, Mar 25, 2026 at 06:54:58PM +0200, Avihai Horon wrote:
> 
> On 3/20/2026 1:12 AM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Allow modules to report data that can only be migrated after VM is stopped.
> > 
> > When this concept is introduced, we will need to account stopcopy size to
> > be part of pending_size as before.
> > 
> > One thing to mention is, when there can be stopcopy size, it means the old
> > "pending_size" may not always be able to reach low enough to kickoff an
> > slow version of query sync.  While it used to be almost guaranteed to
> > happen because if we keep iterating, normally pending_size can go to zero
> > for precopy-only because we assume everything reported can be migrated in
> > precopy phase.
> > 
> > So we need to make sure QEMU will kickoff a synchronized version of query
> > pending when all precopy data is migrated too.  This might be important to
> > VFIO to keep making progress even if the downtime cannot yet be satisfied.
> > 
> > So far, this patch should introduce no functional change, as no module yet
> > report stopcopy size.
> > 
> > This will pave way for VFIO to properly report its pending data sizes,
> > which was actually buggy today.  Will be done in follow up patches.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/migration/register.h | 12 +++++++++
> >   migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
> >   migration/savevm.c           |  7 +++--
> >   migration/trace-events       |  2 +-
> >   4 files changed, 62 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/migration/register.h b/include/migration/register.h
> > index 2320c3a981..3824958ba5 100644
> > --- a/include/migration/register.h
> > +++ b/include/migration/register.h
> > @@ -17,12 +17,24 @@
> >   #include "hw/core/vmstate-if.h"
> > 
> >   typedef struct MigPendingData {
> > +    /*
> > +     * Modules can only update these fields in a query request via its
> > +     * save_query_pending() API.
> > +     */
> 
> Move comment to patch #5?
> 
> >       /* How many bytes are pending for precopy / stopcopy? */
> >       uint64_t precopy_bytes;
> >       /* How many bytes are pending that can be transferred in postcopy? */
> >       uint64_t postcopy_bytes;
> > +    /* How many bytes that can only be transferred when VM stopped? */
> > +    uint64_t stopcopy_bytes;
> 
> Keep consistent phrasing?
> 
> /* Amount of pending bytes that can be transferred either in precopy or
> stopcopy */
> uint64_t precopy_bytes;
> /* Amount of pending bytes that can be transferred in postcopy */
> uint64_t postcopy_bytes;
> /* Amount of pending bytes that can be transferred only in stopcopy */
> uint64_t stopcopy_bytes;
> 
> > +
> > +    /*
> > +     * Modules should never update these fields.
> > +     */
> 
> Move comment to patch #5?

Sure, I'll address all above.

> 
> >       /* Is this a fastpath query (which can be inaccurate)? */
> >       bool fastpath;
> > +    /* Total pending data */
> > +    uint64_t total_bytes;
> >   } MigPendingData ;
> > 
> >   /**
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 99c4d09000..42facb16d1 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3198,6 +3198,44 @@ typedef enum {
> >       MIG_ITERATE_BREAK,          /* Break the loop */
> >   } MigIterateState;
> > 
> > +/* Are we ready to move to the next iteration phase? */
> > +static bool migration_iteration_next_ready(MigrationState *s,
> > +                                           MigPendingData *pending)
> > +{
> > +    /*
> > +     * If the estimated values already suggest us to switchover, mark this
> > +     * iteration finished, time to do a slow sync.
> > +     */
> > +    if (pending->total_bytes <= s->threshold_size) {
> > +        return true;
> > +    }
> > +
> > +    /*
> > +     * Since we may have modules reporting stop-only data, we also want to
> > +     * re-query with slow mode if all precopy data is moved over.  This
> > +     * will also mark the current iteration done.
> > +     *
> > +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> > +     * size too large so it will never yet satisfy the downtime with the
> > +     * current setup (above check).  Here, slow version of re-query helps
> > +     * because we keep trying the best to move whatever we have.
> > +     */
> > +    if (pending->precopy_bytes == 0) {
> > +        return true;
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +static void migration_iteration_go_next(MigPendingData *pending)
> > +{
> > +    /*
> > +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> > +     * into the core layer.
> > +     */
> > +    qemu_savevm_query_pending(pending, false);
> > +}
> 
> I think the iteration terminology here could be confusing, because these two
> functions are called from migration_iteration_run() and they don't refer to
> the same iteration concept.
> How about migration_pass_next_ready/migration_pass_go_next?
> Or migration_dirty_sync_ready/migration_dirty_sync?

I get the confusion, but IIUC the word "iteration" normally was used to
describe one full walk of system resources, introducing "pass" may causing
another way of confusion to me.

E.g. see Dan's post in 2016 and since then iteration was defined as so:

https://www.berrange.com/posts/2016/05/12/analysis-of-techniques-for-ensuring-migration-completion-with-kvm/

Here, migration_iteration_run() was named to say "run the current
iteration", but it doesn't necessarily mean it can only be run once for the
current iteration.

OTOH, I don't think it makes much sense we define "iteration" (or any new
term at all) to be one run of migration_iteration_run(): currently, we
invoke handlers' savevm iterator hook for each migration_iteration_run(),
but that's pretty random behavior..

For RAM, that's MAX_WAIT.  For VFIO, it's something else...

It was still a "grey area" and no way to define, even if we may want to
define it at some point to make each iterator to follow some rule.

Comparing to that, iteration to be defined to describe one round of walk
for all VM resources makes a lot of sense and it's a criticla concept for
precopy.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-02 14:09     ` Peter Xu
@ 2026-04-06 12:20       ` Avihai Horon
  2026-04-07 15:30         ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-04-06 12:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 4/2/2026 5:09 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Mar 25, 2026 at 06:54:58PM +0200, Avihai Horon wrote:
>> On 3/20/2026 1:12 AM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Allow modules to report data that can only be migrated after VM is stopped.
>>>
>>> When this concept is introduced, we will need to account stopcopy size to
>>> be part of pending_size as before.
>>>
>>> One thing to mention is, when there can be stopcopy size, it means the old
>>> "pending_size" may not always be able to reach low enough to kickoff an
>>> slow version of query sync.  While it used to be almost guaranteed to
>>> happen because if we keep iterating, normally pending_size can go to zero
>>> for precopy-only because we assume everything reported can be migrated in
>>> precopy phase.
>>>
>>> So we need to make sure QEMU will kickoff a synchronized version of query
>>> pending when all precopy data is migrated too.  This might be important to
>>> VFIO to keep making progress even if the downtime cannot yet be satisfied.
>>>
>>> So far, this patch should introduce no functional change, as no module yet
>>> report stopcopy size.
>>>
>>> This will pave way for VFIO to properly report its pending data sizes,
>>> which was actually buggy today.  Will be done in follow up patches.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/migration/register.h | 12 +++++++++
>>>    migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>>>    migration/savevm.c           |  7 +++--
>>>    migration/trace-events       |  2 +-
>>>    4 files changed, 62 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/migration/register.h b/include/migration/register.h
>>> index 2320c3a981..3824958ba5 100644
>>> --- a/include/migration/register.h
>>> +++ b/include/migration/register.h
>>> @@ -17,12 +17,24 @@
>>>    #include "hw/core/vmstate-if.h"
>>>
>>>    typedef struct MigPendingData {
>>> +    /*
>>> +     * Modules can only update these fields in a query request via its
>>> +     * save_query_pending() API.
>>> +     */
>> Move comment to patch #5?
>>
>>>        /* How many bytes are pending for precopy / stopcopy? */
>>>        uint64_t precopy_bytes;
>>>        /* How many bytes are pending that can be transferred in postcopy? */
>>>        uint64_t postcopy_bytes;
>>> +    /* How many bytes that can only be transferred when VM stopped? */
>>> +    uint64_t stopcopy_bytes;
>> Keep consistent phrasing?
>>
>> /* Amount of pending bytes that can be transferred either in precopy or
>> stopcopy */
>> uint64_t precopy_bytes;
>> /* Amount of pending bytes that can be transferred in postcopy */
>> uint64_t postcopy_bytes;
>> /* Amount of pending bytes that can be transferred only in stopcopy */
>> uint64_t stopcopy_bytes;
>>
>>> +
>>> +    /*
>>> +     * Modules should never update these fields.
>>> +     */
>> Move comment to patch #5?
> Sure, I'll address all above.
>
>>>        /* Is this a fastpath query (which can be inaccurate)? */
>>>        bool fastpath;
>>> +    /* Total pending data */
>>> +    uint64_t total_bytes;
>>>    } MigPendingData ;
>>>
>>>    /**
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 99c4d09000..42facb16d1 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -3198,6 +3198,44 @@ typedef enum {
>>>        MIG_ITERATE_BREAK,          /* Break the loop */
>>>    } MigIterateState;
>>>
>>> +/* Are we ready to move to the next iteration phase? */
>>> +static bool migration_iteration_next_ready(MigrationState *s,
>>> +                                           MigPendingData *pending)
>>> +{
>>> +    /*
>>> +     * If the estimated values already suggest us to switchover, mark this
>>> +     * iteration finished, time to do a slow sync.
>>> +     */
>>> +    if (pending->total_bytes <= s->threshold_size) {
>>> +        return true;
>>> +    }
>>> +
>>> +    /*
>>> +     * Since we may have modules reporting stop-only data, we also want to
>>> +     * re-query with slow mode if all precopy data is moved over.  This
>>> +     * will also mark the current iteration done.
>>> +     *
>>> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
>>> +     * size too large so it will never yet satisfy the downtime with the
>>> +     * current setup (above check).  Here, slow version of re-query helps
>>> +     * because we keep trying the best to move whatever we have.
>>> +     */
>>> +    if (pending->precopy_bytes == 0) {
>>> +        return true;
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>> +
>>> +static void migration_iteration_go_next(MigPendingData *pending)
>>> +{
>>> +    /*
>>> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
>>> +     * into the core layer.
>>> +     */
>>> +    qemu_savevm_query_pending(pending, false);
>>> +}
>> I think the iteration terminology here could be confusing, because these two
>> functions are called from migration_iteration_run() and they don't refer to
>> the same iteration concept.
>> How about migration_pass_next_ready/migration_pass_go_next?
>> Or migration_dirty_sync_ready/migration_dirty_sync?
> I get the confusion, but IIUC the word "iteration" normally was used to
> describe one full walk of system resources, introducing "pass" may causing
> another way of confusion to me.
>
> E.g. see Dan's post in 2016 and since then iteration was defined as so:
>
> https://www.berrange.com/posts/2016/05/12/analysis-of-techniques-for-ensuring-migration-completion-with-kvm/
>
> Here, migration_iteration_run() was named to say "run the current
> iteration", but it doesn't necessarily mean it can only be run once for the
> current iteration.

Ah, thanks, reading the code with that in mind makes it much clearer.

However, I think iteration meaning is still not consistent, as 
update_iteration_initial_status() and 
MigrationState->iteration_{start_time, initial_bytes, initial_pages} 
refer to iteration as the 100ms period for rate measurement. Or am I 
missing something here?

Also the MIG_ITERATE_{SKIP,BREAK,RESUME} return value of 
migration_iteration_run() (and their docs) may be confusing, as it makes 
"iteration" more related to the while loop.
I just skimmed briefly, but can we make migration_iteration_run() return 
a bool? (I don't think SKIP value is really needed?)
BTW, while at it I noticed that the migration_iteration_run() doc is 
stale [1].

[1] 
https://elixir.bootlin.com/qemu/v11.0.0-rc2/source/migration/migration.c#L3202

>
> OTOH, I don't think it makes much sense we define "iteration" (or any new
> term at all) to be one run of migration_iteration_run(): currently, we
> invoke handlers' savevm iterator hook for each migration_iteration_run(),
> but that's pretty random behavior..
>
> For RAM, that's MAX_WAIT.  For VFIO, it's something else...
>
> It was still a "grey area" and no way to define, even if we may want to
> define it at some point to make each iterator to follow some rule.
>
> Comparing to that, iteration to be defined to describe one round of walk
> for all VM resources makes a lot of sense and it's a criticla concept for
> precopy.
>
Yes, I agree. I guess some refactoring of the migration_iteration_run() 
area could make it clearer (but not necessary as part of this series). 
Thanks.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-06 12:20       ` Avihai Horon
@ 2026-04-07 15:30         ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-07 15:30 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Mon, Apr 06, 2026 at 03:20:21PM +0300, Avihai Horon wrote:
> 
> On 4/2/2026 5:09 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Mar 25, 2026 at 06:54:58PM +0200, Avihai Horon wrote:
> > > On 3/20/2026 1:12 AM, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > Allow modules to report data that can only be migrated after VM is stopped.
> > > > 
> > > > When this concept is introduced, we will need to account stopcopy size to
> > > > be part of pending_size as before.
> > > > 
> > > > One thing to mention is, when there can be stopcopy size, it means the old
> > > > "pending_size" may not always be able to reach low enough to kickoff an
> > > > slow version of query sync.  While it used to be almost guaranteed to
> > > > happen because if we keep iterating, normally pending_size can go to zero
> > > > for precopy-only because we assume everything reported can be migrated in
> > > > precopy phase.
> > > > 
> > > > So we need to make sure QEMU will kickoff a synchronized version of query
> > > > pending when all precopy data is migrated too.  This might be important to
> > > > VFIO to keep making progress even if the downtime cannot yet be satisfied.
> > > > 
> > > > So far, this patch should introduce no functional change, as no module yet
> > > > report stopcopy size.
> > > > 
> > > > This will pave way for VFIO to properly report its pending data sizes,
> > > > which was actually buggy today.  Will be done in follow up patches.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    include/migration/register.h | 12 +++++++++
> > > >    migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
> > > >    migration/savevm.c           |  7 +++--
> > > >    migration/trace-events       |  2 +-
> > > >    4 files changed, 62 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > > index 2320c3a981..3824958ba5 100644
> > > > --- a/include/migration/register.h
> > > > +++ b/include/migration/register.h
> > > > @@ -17,12 +17,24 @@
> > > >    #include "hw/core/vmstate-if.h"
> > > > 
> > > >    typedef struct MigPendingData {
> > > > +    /*
> > > > +     * Modules can only update these fields in a query request via its
> > > > +     * save_query_pending() API.
> > > > +     */
> > > Move comment to patch #5?
> > > 
> > > >        /* How many bytes are pending for precopy / stopcopy? */
> > > >        uint64_t precopy_bytes;
> > > >        /* How many bytes are pending that can be transferred in postcopy? */
> > > >        uint64_t postcopy_bytes;
> > > > +    /* How many bytes that can only be transferred when VM stopped? */
> > > > +    uint64_t stopcopy_bytes;
> > > Keep consistent phrasing?
> > > 
> > > /* Amount of pending bytes that can be transferred either in precopy or
> > > stopcopy */
> > > uint64_t precopy_bytes;
> > > /* Amount of pending bytes that can be transferred in postcopy */
> > > uint64_t postcopy_bytes;
> > > /* Amount of pending bytes that can be transferred only in stopcopy */
> > > uint64_t stopcopy_bytes;
> > > 
> > > > +
> > > > +    /*
> > > > +     * Modules should never update these fields.
> > > > +     */
> > > Move comment to patch #5?
> > Sure, I'll address all above.
> > 
> > > >        /* Is this a fastpath query (which can be inaccurate)? */
> > > >        bool fastpath;
> > > > +    /* Total pending data */
> > > > +    uint64_t total_bytes;
> > > >    } MigPendingData ;
> > > > 
> > > >    /**
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 99c4d09000..42facb16d1 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -3198,6 +3198,44 @@ typedef enum {
> > > >        MIG_ITERATE_BREAK,          /* Break the loop */
> > > >    } MigIterateState;
> > > > 
> > > > +/* Are we ready to move to the next iteration phase? */
> > > > +static bool migration_iteration_next_ready(MigrationState *s,
> > > > +                                           MigPendingData *pending)
> > > > +{
> > > > +    /*
> > > > +     * If the estimated values already suggest us to switchover, mark this
> > > > +     * iteration finished, time to do a slow sync.
> > > > +     */
> > > > +    if (pending->total_bytes <= s->threshold_size) {
> > > > +        return true;
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * Since we may have modules reporting stop-only data, we also want to
> > > > +     * re-query with slow mode if all precopy data is moved over.  This
> > > > +     * will also mark the current iteration done.
> > > > +     *
> > > > +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> > > > +     * size too large so it will never yet satisfy the downtime with the
> > > > +     * current setup (above check).  Here, slow version of re-query helps
> > > > +     * because we keep trying the best to move whatever we have.
> > > > +     */
> > > > +    if (pending->precopy_bytes == 0) {
> > > > +        return true;
> > > > +    }
> > > > +
> > > > +    return false;
> > > > +}
> > > > +
> > > > +static void migration_iteration_go_next(MigPendingData *pending)
> > > > +{
> > > > +    /*
> > > > +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> > > > +     * into the core layer.
> > > > +     */
> > > > +    qemu_savevm_query_pending(pending, false);
> > > > +}
> > > I think the iteration terminology here could be confusing, because these two
> > > functions are called from migration_iteration_run() and they don't refer to
> > > the same iteration concept.
> > > How about migration_pass_next_ready/migration_pass_go_next?
> > > Or migration_dirty_sync_ready/migration_dirty_sync?
> > I get the confusion, but IIUC the word "iteration" normally was used to
> > describe one full walk of system resources, introducing "pass" may causing
> > another way of confusion to me.
> > 
> > E.g. see Dan's post in 2016 and since then iteration was defined as so:
> > 
> > https://www.berrange.com/posts/2016/05/12/analysis-of-techniques-for-ensuring-migration-completion-with-kvm/
> > 
> > Here, migration_iteration_run() was named to say "run the current
> > iteration", but it doesn't necessarily mean it can only be run once for the
> > current iteration.
> 
> Ah, thanks, reading the code with that in mind makes it much clearer.
> 
> However, I think iteration meaning is still not consistent, as
> update_iteration_initial_status() and MigrationState->iteration_{start_time,
> initial_bytes, initial_pages} refer to iteration as the 100ms period for
> rate measurement. Or am I missing something here?

You're right, that was indeed confusing and maybe some rename on the
variables can help the code provide consistent definition of iterations.

It'll still be tricky though as the word "iterate" should still be fine
when used to walk over the resources, which may not be once per iteration.

> 
> Also the MIG_ITERATE_{SKIP,BREAK,RESUME} return value of
> migration_iteration_run() (and their docs) may be confusing, as it makes
> "iteration" more related to the while loop.
> I just skimmed briefly, but can we make migration_iteration_run() return a
> bool? (I don't think SKIP value is really needed?)
> BTW, while at it I noticed that the migration_iteration_run() doc is stale
> [1].
> 
> [1] https://elixir.bootlin.com/qemu/v11.0.0-rc2/source/migration/migration.c#L3202

True.

> 
> > 
> > OTOH, I don't think it makes much sense we define "iteration" (or any new
> > term at all) to be one run of migration_iteration_run(): currently, we
> > invoke handlers' savevm iterator hook for each migration_iteration_run(),
> > but that's pretty random behavior..
> > 
> > For RAM, that's MAX_WAIT.  For VFIO, it's something else...
> > 
> > It was still a "grey area" and no way to define, even if we may want to
> > define it at some point to make each iterator to follow some rule.
> > 
> > Comparing to that, iteration to be defined to describe one round of walk
> > for all VM resources makes a lot of sense and it's a criticla concept for
> > precopy.
> > 
> Yes, I agree. I guess some refactoring of the migration_iteration_run() area
> could make it clearer (but not necessary as part of this series). Thanks.

I'll add a todo to cleanup these part after this series.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
  2026-03-24 11:05   ` Prasad Pandit
  2026-03-25 16:54   ` Avihai Horon
@ 2026-03-27 16:43   ` Juraj Marcin
  2026-04-02 15:16     ` Peter Xu
  2 siblings, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 16:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On 2026-03-19 19:12, Peter Xu wrote:
> Allow modules to report data that can only be migrated after VM is stopped.
> 
> When this concept is introduced, we will need to account stopcopy size to
> be part of pending_size as before.
> 
> One thing to mention is, when there can be stopcopy size, it means the old
> "pending_size" may not always be able to reach low enough to kickoff an
> slow version of query sync.  While it used to be almost guaranteed to
> happen because if we keep iterating, normally pending_size can go to zero
> for precopy-only because we assume everything reported can be migrated in
> precopy phase.
> 
> So we need to make sure QEMU will kickoff a synchronized version of query
> pending when all precopy data is migrated too.  This might be important to
> VFIO to keep making progress even if the downtime cannot yet be satisfied.
> 
> So far, this patch should introduce no functional change, as no module yet
> report stopcopy size.
> 
> This will pave way for VFIO to properly report its pending data sizes,
> which was actually buggy today.  Will be done in follow up patches.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/migration/register.h | 12 +++++++++
>  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
>  migration/savevm.c           |  7 +++--
>  migration/trace-events       |  2 +-
>  4 files changed, 62 insertions(+), 11 deletions(-)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 2320c3a981..3824958ba5 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -17,12 +17,24 @@
>  #include "hw/core/vmstate-if.h"
>  
>  typedef struct MigPendingData {
> +    /*
> +     * Modules can only update these fields in a query request via its
> +     * save_query_pending() API.
> +     */
>      /* How many bytes are pending for precopy / stopcopy? */
>      uint64_t precopy_bytes;

The comment suggests precopy_bytes should include iterable precopy and
also non-iterable precopy (stopcopy) bytes, however, all 3 are then
summed up for total_bytes.

>      /* How many bytes are pending that can be transferred in postcopy? */
>      uint64_t postcopy_bytes;
> +    /* How many bytes that can only be transferred when VM stopped? */
> +    uint64_t stopcopy_bytes;

I was also wondering if having precopy_iterable_bytes,
precopy_non_iterable_bytes, and postcopy_bytes would be clearer, but
given that stopcopy is already a term for this in VFIO it is probably
fine.

> +
> +    /*
> +     * Modules should never update these fields.
> +     */

Maybe splitting input and output parameters, or things which modules
should touch and output of the overall API into different
structures/simple parameters could be better instead of the comment.

>      /* Is this a fastpath query (which can be inaccurate)? */
>      bool fastpath;
> +    /* Total pending data */
> +    uint64_t total_bytes;
>  } MigPendingData ;
>  
>  /**
> diff --git a/migration/migration.c b/migration/migration.c
> index 99c4d09000..42facb16d1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3198,6 +3198,44 @@ typedef enum {
>      MIG_ITERATE_BREAK,          /* Break the loop */
>  } MigIterateState;
>  
> +/* Are we ready to move to the next iteration phase? */
> +static bool migration_iteration_next_ready(MigrationState *s,
> +                                           MigPendingData *pending)
> +{
> +    /*
> +     * If the estimated values already suggest us to switchover, mark this
> +     * iteration finished, time to do a slow sync.
> +     */
> +    if (pending->total_bytes <= s->threshold_size) {
> +        return true;
> +    }
> +
> +    /*
> +     * Since we may have modules reporting stop-only data, we also want to
> +     * re-query with slow mode if all precopy data is moved over.  This
> +     * will also mark the current iteration done.
> +     *
> +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> +     * size too large so it will never yet satisfy the downtime with the
> +     * current setup (above check).  Here, slow version of re-query helps
> +     * because we keep trying the best to move whatever we have.
> +     */
> +    if (pending->precopy_bytes == 0) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
> +static void migration_iteration_go_next(MigPendingData *pending)
> +{
> +    /*
> +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> +     * into the core layer.
> +     */
> +    qemu_savevm_query_pending(pending, false);
> +}

I agree with Avihai regarding the iteration terminology. I slightly
prefer migration_dirty_sync_ready/migration_dirty_sync, but using pass
instead of iteration is also fine.

> +
>  /*
>   * Return true if continue to the next iteration directly, false
>   * otherwise.
> @@ -3209,12 +3247,10 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>                          s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>      bool can_switchover = migration_can_switchover(s);
>      MigPendingData pending = { };
> -    uint64_t pending_size;
>      bool complete_ready;
>  
>      /* Fast path - get the estimated amount of pending data */
>      qemu_savevm_query_pending(&pending, true);
> -    pending_size = pending.precopy_bytes + pending.postcopy_bytes;
>  
>      if (in_postcopy) {
>          /*
> @@ -3222,7 +3258,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy completion doesn't rely on can_switchover, because when
>           * POSTCOPY_ACTIVE it means switchover already happened.
>           */
> -        complete_ready = !pending_size;
> +        complete_ready = !pending.total_bytes;
>          if (s->state == MIGRATION_STATUS_POSTCOPY_DEVICE &&
>              (s->postcopy_package_loaded || complete_ready)) {
>              /*
> @@ -3242,9 +3278,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size <= s->threshold_size) {
> -            qemu_savevm_query_pending(&pending, false);
> -            pending_size = pending.precopy_bytes + pending.postcopy_bytes;
> +        if (migration_iteration_next_ready(s, &pending)) {
> +            migration_iteration_go_next(&pending);
>          }
>  
>          /* Should we switch to postcopy now? */
> @@ -3264,11 +3299,12 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * (2) Pending size is no more than the threshold specified
>           *     (which was calculated from expected downtime)
>           */
> -        complete_ready = can_switchover && (pending_size <= s->threshold_size);
> +        complete_ready = can_switchover &&
> +            (pending.total_bytes <= s->threshold_size);
>      }
>  
>      if (complete_ready) {
> -        trace_migration_thread_low_pending(pending_size);
> +        trace_migration_thread_low_pending(pending.total_bytes);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index b3285d480f..812c72b3e5 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1766,8 +1766,7 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>  {
>      SaveStateEntry *se;
>  
> -    pending->precopy_bytes = 0;
> -    pending->postcopy_bytes = 0;
> +    memset(pending, 0, sizeof(*pending));
>      pending->fastpath = fastpath;
>  
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> @@ -1780,7 +1779,11 @@ void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath)
>          se->ops->save_query_pending(se->opaque, pending);
>      }
>  
> +    pending->total_bytes = pending->precopy_bytes +
> +        pending->stopcopy_bytes + pending->postcopy_bytes;
> +
>      trace_qemu_savevm_query_pending(fastpath, pending->precopy_bytes,
> +                                    pending->stopcopy_bytes,
>                                      pending->postcopy_bytes);
>  }
>  
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f836a8652..175f09f8ad 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,7 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> -qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", postcopy=%"PRIu64
> +qemu_savevm_query_pending(bool fast, uint64_t precopy, uint64_t stopcopy, uint64_t postcopy) "fast=%d, precopy=%"PRIu64", stopcopy=%"PRIu64", postcopy=%"PRIu64
>  loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
> -- 
> 2.50.1
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-03-27 16:43   ` Juraj Marcin
@ 2026-04-02 15:16     ` Peter Xu
  2026-04-07 15:19       ` Juraj Marcin
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-02 15:16 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, Mar 27, 2026 at 05:43:24PM +0100, Juraj Marcin wrote:
> On 2026-03-19 19:12, Peter Xu wrote:
> > Allow modules to report data that can only be migrated after VM is stopped.
> > 
> > When this concept is introduced, we will need to account stopcopy size to
> > be part of pending_size as before.
> > 
> > One thing to mention is, when there can be stopcopy size, it means the old
> > "pending_size" may not always be able to reach low enough to kickoff an
> > slow version of query sync.  While it used to be almost guaranteed to
> > happen because if we keep iterating, normally pending_size can go to zero
> > for precopy-only because we assume everything reported can be migrated in
> > precopy phase.
> > 
> > So we need to make sure QEMU will kickoff a synchronized version of query
> > pending when all precopy data is migrated too.  This might be important to
> > VFIO to keep making progress even if the downtime cannot yet be satisfied.
> > 
> > So far, this patch should introduce no functional change, as no module yet
> > report stopcopy size.
> > 
> > This will pave way for VFIO to properly report its pending data sizes,
> > which was actually buggy today.  Will be done in follow up patches.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/migration/register.h | 12 +++++++++
> >  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
> >  migration/savevm.c           |  7 +++--
> >  migration/trace-events       |  2 +-
> >  4 files changed, 62 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/migration/register.h b/include/migration/register.h
> > index 2320c3a981..3824958ba5 100644
> > --- a/include/migration/register.h
> > +++ b/include/migration/register.h
> > @@ -17,12 +17,24 @@
> >  #include "hw/core/vmstate-if.h"
> >  
> >  typedef struct MigPendingData {
> > +    /*
> > +     * Modules can only update these fields in a query request via its
> > +     * save_query_pending() API.
> > +     */
> >      /* How many bytes are pending for precopy / stopcopy? */
> >      uint64_t precopy_bytes;
> 
> The comment suggests precopy_bytes should include iterable precopy and
> also non-iterable precopy (stopcopy) bytes, however, all 3 are then
> summed up for total_bytes.

Yes the current way to categorize dirty info isn't as clear, but it's the
easiest so far based on the previous definition of must_precopy.  With
that, it's natural to define must-stopcopy, which is separately accounted
from "data that can be copied during stop, but also iterable / precopy-able".

> 
> >      /* How many bytes are pending that can be transferred in postcopy? */
> >      uint64_t postcopy_bytes;
> > +    /* How many bytes that can only be transferred when VM stopped? */
> > +    uint64_t stopcopy_bytes;
> 
> I was also wondering if having precopy_iterable_bytes,
> precopy_non_iterable_bytes, and postcopy_bytes would be clearer, but
> given that stopcopy is already a term for this in VFIO it is probably
> fine.

Yes, your naming is better in some way.  However it can also be slightly
confusing on precopy_non_iterable_bytes to represent stopcopy-only bytes.

IMHO we can leave the "hard problems" (naming..) for later as cleanups and
fix the problem first.

> 
> > +
> > +    /*
> > +     * Modules should never update these fields.
> > +     */
> 
> Maybe splitting input and output parameters, or things which modules
> should touch and output of the overall API into different
> structures/simple parameters could be better instead of the comment.

One thing I can immediately do is making both "exact" and "total_bytes" to
be consts, then force cast them only once when setting it.  Would that be
slightly better?  Or any suggestions?

It's always an option to fix the problem first then think about how to make
it prettier, rather than doing it in one shot.

So far, the immediate goal is to allow reporting VFIO remaining data and/or
expected downtime in query-migrate.

> 
> >      /* Is this a fastpath query (which can be inaccurate)? */
> >      bool fastpath;
> > +    /* Total pending data */
> > +    uint64_t total_bytes;
> >  } MigPendingData ;
> >  
> >  /**
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 99c4d09000..42facb16d1 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3198,6 +3198,44 @@ typedef enum {
> >      MIG_ITERATE_BREAK,          /* Break the loop */
> >  } MigIterateState;
> >  
> > +/* Are we ready to move to the next iteration phase? */
> > +static bool migration_iteration_next_ready(MigrationState *s,
> > +                                           MigPendingData *pending)
> > +{
> > +    /*
> > +     * If the estimated values already suggest us to switchover, mark this
> > +     * iteration finished, time to do a slow sync.
> > +     */
> > +    if (pending->total_bytes <= s->threshold_size) {
> > +        return true;
> > +    }
> > +
> > +    /*
> > +     * Since we may have modules reporting stop-only data, we also want to
> > +     * re-query with slow mode if all precopy data is moved over.  This
> > +     * will also mark the current iteration done.
> > +     *
> > +     * This could happen when e.g. a module (like, VFIO) reports stopcopy
> > +     * size too large so it will never yet satisfy the downtime with the
> > +     * current setup (above check).  Here, slow version of re-query helps
> > +     * because we keep trying the best to move whatever we have.
> > +     */
> > +    if (pending->precopy_bytes == 0) {
> > +        return true;
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +static void migration_iteration_go_next(MigPendingData *pending)
> > +{
> > +    /*
> > +     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> > +     * into the core layer.
> > +     */
> > +    qemu_savevm_query_pending(pending, false);
> > +}
> 
> I agree with Avihai regarding the iteration terminology. I slightly
> prefer migration_dirty_sync_ready/migration_dirty_sync, but using pass
> instead of iteration is also fine.

Replied in the other email, let's see if we can reach a consensus.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-02 15:16     ` Peter Xu
@ 2026-04-07 15:19       ` Juraj Marcin
  2026-04-07 15:32         ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-04-07 15:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On 2026-04-02 11:16, Peter Xu wrote:
> On Fri, Mar 27, 2026 at 05:43:24PM +0100, Juraj Marcin wrote:
> > On 2026-03-19 19:12, Peter Xu wrote:
> > > Allow modules to report data that can only be migrated after VM is stopped.
> > > 
> > > When this concept is introduced, we will need to account stopcopy size to
> > > be part of pending_size as before.
> > > 
> > > One thing to mention is, when there can be stopcopy size, it means the old
> > > "pending_size" may not always be able to reach low enough to kickoff an
> > > slow version of query sync.  While it used to be almost guaranteed to
> > > happen because if we keep iterating, normally pending_size can go to zero
> > > for precopy-only because we assume everything reported can be migrated in
> > > precopy phase.
> > > 
> > > So we need to make sure QEMU will kickoff a synchronized version of query
> > > pending when all precopy data is migrated too.  This might be important to
> > > VFIO to keep making progress even if the downtime cannot yet be satisfied.
> > > 
> > > So far, this patch should introduce no functional change, as no module yet
> > > report stopcopy size.
> > > 
> > > This will pave way for VFIO to properly report its pending data sizes,
> > > which was actually buggy today.  Will be done in follow up patches.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/migration/register.h | 12 +++++++++
> > >  migration/migration.c        | 52 ++++++++++++++++++++++++++++++------
> > >  migration/savevm.c           |  7 +++--
> > >  migration/trace-events       |  2 +-
> > >  4 files changed, 62 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/include/migration/register.h b/include/migration/register.h
> > > index 2320c3a981..3824958ba5 100644
> > > --- a/include/migration/register.h
> > > +++ b/include/migration/register.h
> > > @@ -17,12 +17,24 @@
> > >  #include "hw/core/vmstate-if.h"
> > >  
> > >  typedef struct MigPendingData {
> > > +    /*
> > > +     * Modules can only update these fields in a query request via its
> > > +     * save_query_pending() API.
> > > +     */
> > >      /* How many bytes are pending for precopy / stopcopy? */
> > >      uint64_t precopy_bytes;
> > 
> > The comment suggests precopy_bytes should include iterable precopy and
> > also non-iterable precopy (stopcopy) bytes, however, all 3 are then
> > summed up for total_bytes.
> 
> Yes the current way to categorize dirty info isn't as clear, but it's the
> easiest so far based on the previous definition of must_precopy.  With
> that, it's natural to define must-stopcopy, which is separately accounted
> from "data that can be copied during stop, but also iterable / precopy-able".
> 
> > 
> > >      /* How many bytes are pending that can be transferred in postcopy? */
> > >      uint64_t postcopy_bytes;
> > > +    /* How many bytes that can only be transferred when VM stopped? */
> > > +    uint64_t stopcopy_bytes;
> > 
> > I was also wondering if having precopy_iterable_bytes,
> > precopy_non_iterable_bytes, and postcopy_bytes would be clearer, but
> > given that stopcopy is already a term for this in VFIO it is probably
> > fine.
> 
> Yes, your naming is better in some way.  However it can also be slightly
> confusing on precopy_non_iterable_bytes to represent stopcopy-only bytes.
> 
> IMHO we can leave the "hard problems" (naming..) for later as cleanups and
> fix the problem first.
> 
> > 
> > > +
> > > +    /*
> > > +     * Modules should never update these fields.
> > > +     */
> > 
> > Maybe splitting input and output parameters, or things which modules
> > should touch and output of the overall API into different
> > structures/simple parameters could be better instead of the comment.
> 
> One thing I can immediately do is making both "exact" and "total_bytes" to
> be consts, then force cast them only once when setting it.  Would that be
> slightly better?  Or any suggestions?

I think, moving the fastpath boolean out and passing it as a separate
parameter would help greatly with the input/output separation. Then, if
some modules even validate the condition in the comment, there's no
harm, it'll just get overwritten afterward with the correct value.

> 
> It's always an option to fix the problem first then think about how to make
> it prettier, rather than doing it in one shot.
> 
> So far, the immediate goal is to allow reporting VFIO remaining data and/or
> expected downtime in query-migrate.
> 
...



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending()
  2026-04-07 15:19       ` Juraj Marcin
@ 2026-04-07 15:32         ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-07 15:32 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Tue, Apr 07, 2026 at 05:19:01PM +0200, Juraj Marcin wrote:
> I think, moving the fastpath boolean out and passing it as a separate
> parameter would help greatly with the input/output separation. Then, if
> some modules even validate the condition in the comment, there's no
> harm, it'll just get overwritten afterward with the correct value.

Sure, I'll use it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (6 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-25 17:32   ` Avihai Horon
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

VFIO used to report different things in its fast/slow version of query
pending results.  It was likely because it wants to make sure precopy data
can reach 0 hence trigger sync queries.

Now with stopcopy size reporting facility it doesn't need this hack
anymore.  Fix this.

Copy stable might be too much; just skip it and skip the Fixes.

Cc: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/migration.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index c054c749b0..9dbe5ad9e9 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -591,6 +591,10 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
                           __func__, migration->precopy_init_size,
                           migration->precopy_dirty_size,
                           migration->stopcopy_size);
+        migration->stopcopy_size = 0;
+    } else {
+        migration->stopcopy_size -=
+            (migration->precopy_init_size + migration->precopy_dirty_size);
     }
 }
 
@@ -598,19 +602,18 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t remain;
 
     if (pending->fastpath) {
         if (!vfio_device_state_is_precopy(vbasedev)) {
             return;
         }
-        remain = migration->precopy_init_size + migration->precopy_dirty_size;
     } else {
         vfio_state_pending_sync(vbasedev);
-        remain = migration->stopcopy_size;
     }
 
-    pending->precopy_bytes += remain;
+    pending->precopy_bytes +=
+        migration->precopy_init_size + migration->precopy_dirty_size;
+    pending->stopcopy_bytes += migration->stopcopy_size;
 
     trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
                              migration->precopy_init_size,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
@ 2026-03-25 17:32   ` Avihai Horon
  2026-04-02 15:28     ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-03-25 17:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 3/20/2026 1:12 AM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> VFIO used to report different things in its fast/slow version of query
> pending results.  It was likely because it wants to make sure precopy data
> can reach 0 hence trigger sync queries.

It was to guarantee precopy data can reach 0 and trigger dirty sync 
queries, since not all VFIO device data may be precopy-able.

>
> Now with stopcopy size reporting facility it doesn't need this hack
> anymore.  Fix this.

Looks good, now the fast/slow path are consistent, plus we get the 
benefit that if VFIO device has much stopcopy data, migration will try 
to reduce RAM/other iterative devices to the minimum continuously 
instead of jumping from fast to slow path as it currently does.

I still want to test this later with a few workloads, just to make sure 
we don't miss anything here.

>
> Copy stable might be too much; just skip it and skip the Fixes.
>
> Cc: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/vfio/migration.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index c054c749b0..9dbe5ad9e9 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -591,6 +591,10 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
>                             __func__, migration->precopy_init_size,
>                             migration->precopy_dirty_size,
>                             migration->stopcopy_size);
> +        migration->stopcopy_size = 0;
> +    } else {
> +        migration->stopcopy_size -=
> +            (migration->precopy_init_size + migration->precopy_dirty_size);

Query of stopcopy and precopy are not atomic, so we better use MIN() here:

migration->stopcopy_size -= MIN(migration->stopcopy_size, 
migration->precopy_init_size + migration->precopy_dirty_size);

>       }
>   }
>
> @@ -598,19 +602,18 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t remain;
>
>       if (pending->fastpath) {
>           if (!vfio_device_state_is_precopy(vbasedev)) {
>               return;
>           }
> -        remain = migration->precopy_init_size + migration->precopy_dirty_size;
>       } else {
>           vfio_state_pending_sync(vbasedev);
> -        remain = migration->stopcopy_size;
>       }
>
> -    pending->precopy_bytes += remain;
> +    pending->precopy_bytes +=
> +        migration->precopy_init_size + migration->precopy_dirty_size;
> +    pending->stopcopy_bytes += migration->stopcopy_size;

Now that migration->stopcopy_size holds only the stopcopy size (and not 
stopcopy+precopy), we should remove these lines from 
vfio_update_estimated_pending_data():

     /* The total size remaining requires separate accounting */
     migration->stopcopy_size -= data_size;

Thanks.

>
>       trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
>                                migration->precopy_init_size,
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-03-25 17:32   ` Avihai Horon
@ 2026-04-02 15:28     ` Peter Xu
  2026-04-02 15:55       ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-02 15:28 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Wed, Mar 25, 2026 at 07:32:12PM +0200, Avihai Horon wrote:
> 
> On 3/20/2026 1:12 AM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > VFIO used to report different things in its fast/slow version of query
> > pending results.  It was likely because it wants to make sure precopy data
> > can reach 0 hence trigger sync queries.
> 
> It was to guarantee precopy data can reach 0 and trigger dirty sync queries,
> since not all VFIO device data may be precopy-able.

Thanks for confirming.

> 
> > 
> > Now with stopcopy size reporting facility it doesn't need this hack
> > anymore.  Fix this.
> 
> Looks good, now the fast/slow path are consistent, plus we get the benefit
> that if VFIO device has much stopcopy data, migration will try to reduce
> RAM/other iterative devices to the minimum continuously instead of jumping
> from fast to slow path as it currently does.

Yep.

> 
> I still want to test this later with a few workloads, just to make sure we
> don't miss anything here.

Thanks a lot.  Let me know if there's any update.

> 
> > 
> > Copy stable might be too much; just skip it and skip the Fixes.
> > 
> > Cc: Avihai Horon <avihaih@nvidia.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   hw/vfio/migration.c | 11 +++++++----
> >   1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index c054c749b0..9dbe5ad9e9 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -591,6 +591,10 @@ static void vfio_state_pending_sync(VFIODevice *vbasedev)
> >                             __func__, migration->precopy_init_size,
> >                             migration->precopy_dirty_size,
> >                             migration->stopcopy_size);
> > +        migration->stopcopy_size = 0;
> > +    } else {
> > +        migration->stopcopy_size -=
> > +            (migration->precopy_init_size + migration->precopy_dirty_size);
> 
> Query of stopcopy and precopy are not atomic, so we better use MIN() here:
> 
> migration->stopcopy_size -= MIN(migration->stopcopy_size,
> migration->precopy_init_size + migration->precopy_dirty_size);

Yes.  I'll revisit all these places on atomicity and possible overflow
issues in this series after I get better understanding on the "data shrink"
possibilities.

> 
> >       }
> >   }
> > 
> > @@ -598,19 +602,18 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
> >   {
> >       VFIODevice *vbasedev = opaque;
> >       VFIOMigration *migration = vbasedev->migration;
> > -    uint64_t remain;
> > 
> >       if (pending->fastpath) {
> >           if (!vfio_device_state_is_precopy(vbasedev)) {
> >               return;
> >           }
> > -        remain = migration->precopy_init_size + migration->precopy_dirty_size;
> >       } else {
> >           vfio_state_pending_sync(vbasedev);
> > -        remain = migration->stopcopy_size;
> >       }
> > 
> > -    pending->precopy_bytes += remain;
> > +    pending->precopy_bytes +=
> > +        migration->precopy_init_size + migration->precopy_dirty_size;
> > +    pending->stopcopy_bytes += migration->stopcopy_size;
> 
> Now that migration->stopcopy_size holds only the stopcopy size (and not
> stopcopy+precopy), we should remove these lines from
> vfio_update_estimated_pending_data():
> 
>     /* The total size remaining requires separate accounting */
>     migration->stopcopy_size -= data_size;

Good point, I missed this part when changing the definition, I'll fix.

Thanks,

> 
> Thanks.
> 
> > 
> >       trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
> >                                migration->precopy_init_size,
> > --
> > 2.50.1
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-04-02 15:28     ` Peter Xu
@ 2026-04-02 15:55       ` Peter Xu
  2026-04-06 12:34         ` Avihai Horon
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-02 15:55 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Thu, Apr 02, 2026 at 11:28:09AM -0400, Peter Xu wrote:
> > Now that migration->stopcopy_size holds only the stopcopy size (and not
> > stopcopy+precopy), we should remove these lines from
> > vfio_update_estimated_pending_data():
> > 
> >     /* The total size remaining requires separate accounting */
> >     migration->stopcopy_size -= data_size;
> 
> Good point, I missed this part when changing the definition, I'll fix.

Instead of changing the definition, I found that it might be cleaner to
keep the definition of stopcopy cache variable for VFIO (considering the
other invokation during save setup), then just do one more step for math.

The new version should look like this:

===8<===

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 655ff7712a..5aff8de844 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -585,19 +585,27 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
-    uint64_t remain;
+    uint64_t precopy_size, stopcopy_size;
 
     if (pending->exact) {
         vfio_state_pending_sync(vbasedev);
-        remain = migration->stopcopy_size;
     } else {
         if (!vfio_device_state_is_precopy(vbasedev)) {
             return;
         }
-        remain = migration->precopy_init_size + migration->precopy_dirty_size;
     }
 
-    pending->precopy_bytes += remain;
+    precopy_size =
+        migration->precopy_init_size + migration->precopy_dirty_size;
+
+    if (migration->stopcopy_size > precopy_size) {
+        stopcopy_size = migration->stopcopy_size - precopy_size;
+    } else {
+        stopcopy_size = 0;
+    }
+
+    pending->precopy_bytes += precopy_size;
+    pending->stopcopy_bytes += stopcopy_size;
 
     trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
                              migration->precopy_init_size,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-04-02 15:55       ` Peter Xu
@ 2026-04-06 12:34         ` Avihai Horon
  2026-04-07 15:45           ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Avihai Horon @ 2026-04-06 12:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater


On 4/2/2026 6:55 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Apr 02, 2026 at 11:28:09AM -0400, Peter Xu wrote:
>>> Now that migration->stopcopy_size holds only the stopcopy size (and not
>>> stopcopy+precopy), we should remove these lines from
>>> vfio_update_estimated_pending_data():
>>>
>>>      /* The total size remaining requires separate accounting */
>>>      migration->stopcopy_size -= data_size;
>> Good point, I missed this part when changing the definition, I'll fix.
> Instead of changing the definition, I found that it might be cleaner to
> keep the definition of stopcopy cache variable for VFIO (considering the
> other invokation during save setup), then just do one more step for math.
>
> The new version should look like this:

Yes, that's better. Just small comment below.

>
> ===8<===
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 655ff7712a..5aff8de844 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -585,19 +585,27 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
>   {
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
> -    uint64_t remain;
> +    uint64_t precopy_size, stopcopy_size;
>
>       if (pending->exact) {
>           vfio_state_pending_sync(vbasedev);
> -        remain = migration->stopcopy_size;
>       } else {
>           if (!vfio_device_state_is_precopy(vbasedev)) {
>               return;

Now in the estimate path we shouldn't bail early if not in precopy 
state, so estimate/exact reports are consistent also for VFIO devices 
that don't support precopy.

Thanks.

>           }
> -        remain = migration->precopy_init_size + migration->precopy_dirty_size;
>       }
>
> -    pending->precopy_bytes += remain;
> +    precopy_size =
> +        migration->precopy_init_size + migration->precopy_dirty_size;
> +
> +    if (migration->stopcopy_size > precopy_size) {
> +        stopcopy_size = migration->stopcopy_size - precopy_size;
> +    } else {
> +        stopcopy_size = 0;
> +    }
> +
> +    pending->precopy_bytes += precopy_size;
> +    pending->stopcopy_bytes += stopcopy_size;
>
>       trace_vfio_state_pending(vbasedev->name, migration->stopcopy_size,
>                                migration->precopy_init_size,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data
  2026-04-06 12:34         ` Avihai Horon
@ 2026-04-07 15:45           ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-07 15:45 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P. Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Cédric Le Goater

On Mon, Apr 06, 2026 at 03:34:20PM +0300, Avihai Horon wrote:
> 
> On 4/2/2026 6:55 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, Apr 02, 2026 at 11:28:09AM -0400, Peter Xu wrote:
> > > > Now that migration->stopcopy_size holds only the stopcopy size (and not
> > > > stopcopy+precopy), we should remove these lines from
> > > > vfio_update_estimated_pending_data():
> > > > 
> > > >      /* The total size remaining requires separate accounting */
> > > >      migration->stopcopy_size -= data_size;
> > > Good point, I missed this part when changing the definition, I'll fix.
> > Instead of changing the definition, I found that it might be cleaner to
> > keep the definition of stopcopy cache variable for VFIO (considering the
> > other invokation during save setup), then just do one more step for math.
> > 
> > The new version should look like this:
> 
> Yes, that's better. Just small comment below.
> 
> > 
> > ===8<===
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 655ff7712a..5aff8de844 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -585,19 +585,27 @@ static void vfio_state_pending(void *opaque, MigPendingData *pending)
> >   {
> >       VFIODevice *vbasedev = opaque;
> >       VFIOMigration *migration = vbasedev->migration;
> > -    uint64_t remain;
> > +    uint64_t precopy_size, stopcopy_size;
> > 
> >       if (pending->exact) {
> >           vfio_state_pending_sync(vbasedev);
> > -        remain = migration->stopcopy_size;
> >       } else {
> >           if (!vfio_device_state_is_precopy(vbasedev)) {
> >               return;
> 
> Now in the estimate path we shouldn't bail early if not in precopy state, so
> estimate/exact reports are consistent also for VFIO devices that don't
> support precopy.

Will do.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (7 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
@ 2026-03-19 23:12 ` Peter Xu
  2026-03-20  6:12   ` Yong Huang
                     ` (2 more replies)
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
                   ` (2 subsequent siblings)
  11 siblings, 3 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

It used to hide in RAM dirty sync path.  Now with more modules being able
to slow sync on dirty information, keeping it there may not be good anymore
because it's not RAM's own concept for iterations: all modules should
follow.

More importantly, mgmt may try to query dirty info (to make policy
decisions like adjusting downtime) by listening to iteration count changes
via QMP events.  So we must make sure the boost of iterations only happens
_after_ the dirty sync operations with whatever form (RAM's dirty bitmap
sync, or VFIO's different ioctls to fetch latest dirty info from kernel).

Move this to core migration path to manage, together with the event
generation, so that it can be well ordered with the sync operations for all
modules.

This brings a good side effect that we should have an old issue regarding
to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
counts (because it invokes sync ops).  Now it won't, which is actually the
right behavior.

Said that, we have code (not only QEMU, but likely mgmt too) assuming the
1st iteration will always shows dirty count to 1.  Make it initialized with
1 this time, because we'll miss the dirty sync for setup() on boosting this
counter now.

Cc: Yong Huang <yong.huang@smartx.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-stats.h |  3 ++-
 migration/migration.c       | 29 ++++++++++++++++++++++++++---
 migration/ram.c             |  6 ------
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 1153520f7a..326ddb0088 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -43,7 +43,8 @@ typedef struct {
      */
     uint64_t dirty_pages_rate;
     /*
-     * Number of times we have synchronized guest bitmaps.
+     * Number of times we have synchronized guest bitmaps.  This always
+     * starts from 1 for the 1st iteration.
      */
     uint64_t dirty_sync_count;
     /*
diff --git a/migration/migration.c b/migration/migration.c
index 42facb16d1..ad8a824585 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
     s->threshold_size = 0;
     s->switchover_acked = false;
     s->rdma_migration = false;
+
     /*
-     * set mig_stats memory to zero for a new migration
+     * set mig_stats memory to zero for a new migration.. except the
+     * iteration counter, which we want to make sure it returns 1 for the
+     * first iteration.
      */
     memset(&mig_stats, 0, sizeof(mig_stats));
+    mig_stats.dirty_sync_count = 1;
+
     migration_reset_vfio_bytes_transferred();
 
     s->postcopy_package_loaded = false;
@@ -3230,10 +3235,28 @@ static bool migration_iteration_next_ready(MigrationState *s,
 static void migration_iteration_go_next(MigPendingData *pending)
 {
     /*
-     * Do a slow sync will achieve this.  TODO: move RAM iteration code
-     * into the core layer.
+     * Do a slow sync first before boosting the iteration count.
      */
     qemu_savevm_query_pending(pending, false);
+
+    /*
+     * Boost dirty sync count to reflect we finished one iteration.
+     *
+     * NOTE: we need to make sure when this happens (together with the
+     * event sent below) all modules have slow-synced the pending data
+     * above.  That means a write mem barrier, but qatomic_add() should be
+     * enough.
+     *
+     * It's because a mgmt could wait on the iteration event to query again
+     * on pending data for policy changes (e.g. downtime adjustments).  The
+     * ordering will make sure the query will fetch the latest results from
+     * all the modules.
+     */
+    qatomic_add(&mig_stats.dirty_sync_count, 1);
+
+    if (migrate_events()) {
+        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
+    }
 }
 
 /*
diff --git a/migration/ram.c b/migration/ram.c
index 89f761a471..29e9608715 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
     RAMBlock *block;
     int64_t end_time;
 
-    qatomic_add(&mig_stats.dirty_sync_count, 1);
-
     if (!rs->time_last_bitmap_sync) {
         rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     }
@@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
         rs->num_dirty_pages_period = 0;
         rs->bytes_xfer_prev = migration_transferred_bytes();
     }
-    if (migrate_events()) {
-        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
-        qapi_event_send_migration_pass(generation);
-    }
 }
 
 void migration_bitmap_sync_precopy(bool last_stage)
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
@ 2026-03-20  6:12   ` Yong Huang
  2026-03-20  9:49   ` Prasad Pandit
  2026-03-27 16:49   ` Juraj Marcin
  2 siblings, 0 replies; 67+ messages in thread
From: Yong Huang @ 2026-03-20  6:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

[-- Attachment #1: Type: text/plain, Size: 5298 bytes --]

Thanks,

Reviewed-by: Hyman Huang <yong.huang@smartx.com>

On Fri, Mar 20, 2026 at 7:13 AM Peter Xu <peterx@redhat.com> wrote:

> It used to hide in RAM dirty sync path.  Now with more modules being able
> to slow sync on dirty information, keeping it there may not be good anymore
> because it's not RAM's own concept for iterations: all modules should
> follow.
>
> More importantly, mgmt may try to query dirty info (to make policy
> decisions like adjusting downtime) by listening to iteration count changes
> via QMP events.  So we must make sure the boost of iterations only happens
> _after_ the dirty sync operations with whatever form (RAM's dirty bitmap
> sync, or VFIO's different ioctls to fetch latest dirty info from kernel).
>
> Move this to core migration path to manage, together with the event
> generation, so that it can be well ordered with the sync operations for all
> modules.
>
> This brings a good side effect that we should have an old issue regarding
> to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
> counts (because it invokes sync ops).  Now it won't, which is actually the
> right behavior.
>
> Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> 1st iteration will always shows dirty count to 1.  Make it initialized with
> 1 this time, because we'll miss the dirty sync for setup() on boosting this
> counter now.
>
> Cc: Yong Huang <yong.huang@smartx.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  3 ++-
>  migration/migration.c       | 29 ++++++++++++++++++++++++++---
>  migration/ram.c             |  6 ------
>  3 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 1153520f7a..326ddb0088 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -43,7 +43,8 @@ typedef struct {
>       */
>      uint64_t dirty_pages_rate;
>      /*
> -     * Number of times we have synchronized guest bitmaps.
> +     * Number of times we have synchronized guest bitmaps.  This always
> +     * starts from 1 for the 1st iteration.
>       */
>      uint64_t dirty_sync_count;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 42facb16d1..ad8a824585 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
>      s->threshold_size = 0;
>      s->switchover_acked = false;
>      s->rdma_migration = false;
> +
>      /*
> -     * set mig_stats memory to zero for a new migration
> +     * set mig_stats memory to zero for a new migration.. except the
> +     * iteration counter, which we want to make sure it returns 1 for the
> +     * first iteration.
>       */
>      memset(&mig_stats, 0, sizeof(mig_stats));
> +    mig_stats.dirty_sync_count = 1;
> +
>      migration_reset_vfio_bytes_transferred();
>
>      s->postcopy_package_loaded = false;
> @@ -3230,10 +3235,28 @@ static bool
> migration_iteration_next_ready(MigrationState *s,
>  static void migration_iteration_go_next(MigPendingData *pending)
>  {
>      /*
> -     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> -     * into the core layer.
> +     * Do a slow sync first before boosting the iteration count.
>       */
>      qemu_savevm_query_pending(pending, false);
> +
> +    /*
> +     * Boost dirty sync count to reflect we finished one iteration.
> +     *
> +     * NOTE: we need to make sure when this happens (together with the
> +     * event sent below) all modules have slow-synced the pending data
> +     * above.  That means a write mem barrier, but qatomic_add() should be
> +     * enough.
> +     *
> +     * It's because a mgmt could wait on the iteration event to query
> again
> +     * on pending data for policy changes (e.g. downtime adjustments).
> The
> +     * ordering will make sure the query will fetch the latest results
> from
> +     * all the modules.
> +     */
> +    qatomic_add(&mig_stats.dirty_sync_count, 1);
> +
> +    if (migrate_events()) {
> +        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
> +    }
>  }
>
>  /*
> diff --git a/migration/ram.c b/migration/ram.c
> index 89f761a471..29e9608715 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool
> last_stage)
>      RAMBlock *block;
>      int64_t end_time;
>
> -    qatomic_add(&mig_stats.dirty_sync_count, 1);
> -
>      if (!rs->time_last_bitmap_sync) {
>          rs->time_last_bitmap_sync =
> qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      }
> @@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs,
> bool last_stage)
>          rs->num_dirty_pages_period = 0;
>          rs->bytes_xfer_prev = migration_transferred_bytes();
>      }
> -    if (migrate_events()) {
> -        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
> -        qapi_event_send_migration_pass(generation);
> -    }
>  }
>
>  void migration_bitmap_sync_precopy(bool last_stage)
> --
> 2.50.1
>
>

-- 
Best regards

[-- Attachment #2: Type: text/html, Size: 6771 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
  2026-03-20  6:12   ` Yong Huang
@ 2026-03-20  9:49   ` Prasad Pandit
  2026-04-02 15:35     ` Peter Xu
  2026-03-27 16:49   ` Juraj Marcin
  2 siblings, 1 reply; 67+ messages in thread
From: Prasad Pandit @ 2026-03-20  9:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

On Fri, 20 Mar 2026 at 04:44, Peter Xu <peterx@redhat.com> wrote:
> It used to hide in RAM dirty sync path.  Now with more modules being able
> to slow sync on dirty information, keeping it there may not be good anymore
> because it's not RAM's own concept for iterations: all modules should
> follow.
>
> More importantly, mgmt may try to query dirty info (to make policy
> decisions like adjusting downtime) by listening to iteration count changes
> via QMP events.  So we must make sure the boost of iterations only happens
> _after_ the dirty sync operations with whatever form (RAM's dirty bitmap
> sync, or VFIO's different ioctls to fetch latest dirty info from kernel).
>
> Move this to core migration path to manage, together with the event
> generation, so that it can be well ordered with the sync operations for all
> modules.
>
> This brings a good side effect that we should have an old issue regarding
> to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
> counts (because it invokes sync ops).  Now it won't, which is actually the
> right behavior.
>
> Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> 1st iteration will always shows dirty count to 1.

* Where do we make this assumption? I mostly see 'dirty_sync_count'
read/used as is, only cpu_throttle_dirty_sync_timer_tick() seems to
skip one *_bitmap_sync_precopy() call when sync_cnt <= 1. This'd works
for zero(0) as well.

> Cc: Yong Huang <yong.huang@smartx.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  3 ++-
>  migration/migration.c       | 29 ++++++++++++++++++++++++++---
>  migration/ram.c             |  6 ------
>  3 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 1153520f7a..326ddb0088 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -43,7 +43,8 @@ typedef struct {
>       */
>      uint64_t dirty_pages_rate;
>      /*
> -     * Number of times we have synchronized guest bitmaps.
> +     * Number of times we have synchronized guest bitmaps.  This always
> +     * starts from 1 for the 1st iteration.
>       */
>      uint64_t dirty_sync_count;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 42facb16d1..ad8a824585 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1654,10 +1654,15 @@ int migrate_init(MigrationState *s, Error **errp)
>      s->threshold_size = 0;
>      s->switchover_acked = false;
>      s->rdma_migration = false;
> +
>      /*
> -     * set mig_stats memory to zero for a new migration
> +     * set mig_stats memory to zero for a new migration.. except the
> +     * iteration counter, which we want to make sure it returns 1 for the
> +     * first iteration.
>       */
>      memset(&mig_stats, 0, sizeof(mig_stats));
> +    mig_stats.dirty_sync_count = 1;
> +
>      migration_reset_vfio_bytes_transferred();
>
>      s->postcopy_package_loaded = false;
> @@ -3230,10 +3235,28 @@ static bool migration_iteration_next_ready(MigrationState *s,
>  static void migration_iteration_go_next(MigPendingData *pending)
>  {
>      /*
> -     * Do a slow sync will achieve this.  TODO: move RAM iteration code
> -     * into the core layer.
> +     * Do a slow sync first before boosting the iteration count.
>       */
>      qemu_savevm_query_pending(pending, false);
> +
> +    /*
> +     * Boost dirty sync count to reflect we finished one iteration.
> +     *
> +     * NOTE: we need to make sure when this happens (together with the
> +     * event sent below) all modules have slow-synced the pending data
> +     * above.  That means a write mem barrier, but qatomic_add() should be
> +     * enough.
> +     *
> +     * It's because a mgmt could wait on the iteration event to query again
> +     * on pending data for policy changes (e.g. downtime adjustments).  The
> +     * ordering will make sure the query will fetch the latest results from
> +     * all the modules.
> +     */
> +    qatomic_add(&mig_stats.dirty_sync_count, 1);
> +
> +    if (migrate_events()) {
> +        qapi_event_send_migration_pass(mig_stats.dirty_sync_count);
> +    }
>  }
>
>  /*
> diff --git a/migration/ram.c b/migration/ram.c
> index 89f761a471..29e9608715 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1136,8 +1136,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>      RAMBlock *block;
>      int64_t end_time;
>
> -    qatomic_add(&mig_stats.dirty_sync_count, 1);
> -
>      if (!rs->time_last_bitmap_sync) {
>          rs->time_last_bitmap_sync = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      }
> @@ -1172,10 +1170,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>          rs->num_dirty_pages_period = 0;
>          rs->bytes_xfer_prev = migration_transferred_bytes();
>      }
> -    if (migrate_events()) {
> -        uint64_t generation = qatomic_read(&mig_stats.dirty_sync_count);
> -        qapi_event_send_migration_pass(generation);
> -    }
>  }
>
>  void migration_bitmap_sync_precopy(bool last_stage)
> --

* Change looks okay. Setting dirty_sync_count = 1 seems like special
casing, we need not do it if it's not required.

Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-20  9:49   ` Prasad Pandit
@ 2026-04-02 15:35     ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-02 15:35 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

On Fri, Mar 20, 2026 at 03:19:17PM +0530, Prasad Pandit wrote:
> > Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> > 1st iteration will always shows dirty count to 1.
> 
> * Where do we make this assumption? I mostly see 'dirty_sync_count'
> read/used as is, only cpu_throttle_dirty_sync_timer_tick() seems to
> skip one *_bitmap_sync_precopy() call when sync_cnt <= 1. This'd works
> for zero(0) as well.

That is one reason, the stronger reason is mgmt like libvirt can also read
this value from QMP queries, and it can matter on mgmt decisions on
migration operations. Changing definition of this value is an ABI break.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
  2026-03-20  6:12   ` Yong Huang
  2026-03-20  9:49   ` Prasad Pandit
@ 2026-03-27 16:49   ` Juraj Marcin
  2026-04-02 15:42     ` Peter Xu
  2 siblings, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 16:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

Hi Peter,

Did you maybe mean s/Make/Take/ in the subject?

On 2026-03-19 19:12, Peter Xu wrote:
> It used to hide in RAM dirty sync path.  Now with more modules being able
> to slow sync on dirty information, keeping it there may not be good anymore
> because it's not RAM's own concept for iterations: all modules should
> follow.
> 
> More importantly, mgmt may try to query dirty info (to make policy
> decisions like adjusting downtime) by listening to iteration count changes
> via QMP events.  So we must make sure the boost of iterations only happens
> _after_ the dirty sync operations with whatever form (RAM's dirty bitmap
> sync, or VFIO's different ioctls to fetch latest dirty info from kernel).
> 
> Move this to core migration path to manage, together with the event
> generation, so that it can be well ordered with the sync operations for all
> modules.
> 
> This brings a good side effect that we should have an old issue regarding
> to cpu_throttle_dirty_sync_timer_tick() which can randomly boost iteration
> counts (because it invokes sync ops).  Now it won't, which is actually the
> right behavior.
> 
> Said that, we have code (not only QEMU, but likely mgmt too) assuming the
> 1st iteration will always shows dirty count to 1.  Make it initialized with
> 1 this time, because we'll miss the dirty sync for setup() on boosting this
> counter now.
> 
> Cc: Yong Huang <yong.huang@smartx.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h |  3 ++-
>  migration/migration.c       | 29 ++++++++++++++++++++++++++---
>  migration/ram.c             |  6 ------
>  3 files changed, 28 insertions(+), 10 deletions(-)

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 09/12] migration: Make iteration counter out of RAM
  2026-03-27 16:49   ` Juraj Marcin
@ 2026-04-02 15:42     ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-02 15:42 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater,
	Yong Huang

On Fri, Mar 27, 2026 at 05:49:46PM +0100, Juraj Marcin wrote:
> Hi Peter,
> 
> Did you maybe mean s/Make/Take/ in the subject?

I was trying to use "take", but maybe it isn't accurate.

I asked my English teacher (Gemini Pro..) and I was told "move" is the
best, "take" worst, "make" OK but only in "dev-talks".  I'll use "move".

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (8 preceding siblings ...)
  2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
@ 2026-03-19 23:13 ` Peter Xu
  2026-03-23 10:26   ` Prasad Pandit
  2026-03-27 17:07   ` Juraj Marcin
  2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
  2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
  11 siblings, 2 replies; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

Add a helper to be able to return an estimate of switchover bw, called
migration_get_switchover_bw(). Use it to sligitly simplify the current
code.

This will be used in new codes later to remove expected_downtime.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  1 +
 migration/migration.c | 43 ++++++++++++++++++++++---------------------
 2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index b6888daced..bf3ee6cc07 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -586,6 +586,7 @@ void migration_cancel(void);
 void migration_populate_vfio_info(MigrationInfo *info);
 void migration_reset_vfio_bytes_transferred(void);
 void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
+int64_t migration_downtime_calc_expected(MigrationState *s);
 
 /*
  * Migration thread waiting for return path thread.  Return non-zero if an
diff --git a/migration/migration.c b/migration/migration.c
index ad8a824585..56d605ede9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
     migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
 }
 
+/*
+ * Returns the estimated switchover bandwidth (unit: bytes / seconds)
+ */
+static double migration_get_switchover_bw(MigrationState *s)
+{
+    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
+
+    if (switchover_bw) {
+        /* If user specified, prioritize this value and don't estimate */
+        return (double)switchover_bw;
+    }
+
+    return s->mbps / 8 * 1000 * 1000;
+}
+
 bool migration_is_running(void)
 {
     MigrationState *s = current_migration;
@@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
 {
     uint64_t transferred, transferred_pages, time_spent;
     uint64_t current_bytes; /* bytes transferred since the beginning */
-    uint64_t switchover_bw;
-    /* Expected bandwidth when switching over to destination QEMU */
-    double expected_bw_per_ms;
-    double bandwidth;
+    double switchover_bw;
 
     if (current_time < s->iteration_start_time + BUFFER_DELAY) {
         return;
     }
 
-    switchover_bw = migrate_avail_switchover_bandwidth();
     current_bytes = migration_transferred_bytes();
     transferred = current_bytes - s->iteration_initial_bytes;
     time_spent = current_time - s->iteration_start_time;
-    bandwidth = (double)transferred / time_spent;
-
-    if (switchover_bw) {
-        /*
-         * If the user specified a switchover bandwidth, let's trust the
-         * user so that can be more accurate than what we estimated.
-         */
-        expected_bw_per_ms = (double)switchover_bw / 1000;
-    } else {
-        /* If the user doesn't specify bandwidth, we use the estimated */
-        expected_bw_per_ms = bandwidth;
-    }
-
-    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
-
     s->mbps = (((double) transferred * 8.0) /
                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
 
+    /* NOTE: only update this after bandwidth (s->mbps) updated */
+    switchover_bw = migration_get_switchover_bw(s) / 1000;
+    s->threshold_size = switchover_bw * migrate_downtime_limit();
+
     transferred_pages = ram_get_total_transferred_pages() -
                             s->iteration_initial_pages;
     s->pages_per_second = (double) transferred_pages /
@@ -3178,7 +3178,8 @@ static void migration_update_counters(MigrationState *s,
 
     trace_migrate_transferred(transferred, time_spent,
                               /* Both in unit bytes/ms */
-                              bandwidth, switchover_bw / 1000,
+                              (uint64_t)s->mbps,
+                              (uint64_t)switchover_bw,
                               s->threshold_size);
 }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
@ 2026-03-23 10:26   ` Prasad Pandit
  2026-03-27 17:07   ` Juraj Marcin
  1 sibling, 0 replies; 67+ messages in thread
From: Prasad Pandit @ 2026-03-23 10:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:45, Peter Xu <peterx@redhat.com> wrote:
> Add a helper to be able to return an estimate of switchover bw, called

* Add a helper to return ...
* bw -> bandwidth

> migration_get_switchover_bw(). Use it to sligitly simplify the current
> code.

* sligitly -> slightly ;  (Better still: use it to simplify ...)

* This simplification is nice and required.

> This will be used in new codes later to remove expected_downtime.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h |  1 +
>  migration/migration.c | 43 ++++++++++++++++++++++---------------------
>  2 files changed, 23 insertions(+), 21 deletions(-)
>
> diff --git a/migration/migration.h b/migration/migration.h
> index b6888daced..bf3ee6cc07 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -586,6 +586,7 @@ void migration_cancel(void);
>  void migration_populate_vfio_info(MigrationInfo *info);
>  void migration_reset_vfio_bytes_transferred(void);
>  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> +int64_t migration_downtime_calc_expected(MigrationState *s);

* Let's return unsigned int64 -> uint64_t  here? Downtime should be
positive OR does it return a negative error code?

* This function is not used in this patch, it is unrelated that way.
Let's add it in the patch where it is used?

> --- a/migration/migration.c
> +++ b/migration/migration.c
> +/*
> + * Returns the estimated switchover bandwidth (unit: bytes / seconds)
> + */
> +static double migration_get_switchover_bw(MigrationState *s)
> +{
> +    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
> +
> +    if (switchover_bw) {
> +        /* If user specified, prioritize this value and don't estimate */
> +        return (double)switchover_bw;
> +    }
> +
> +    return s->mbps / 8 * 1000 * 1000;
> +}
> +

* 100 mbps will return => 100 / 8 * 1000 * 1000 => 12,500,000 bytes/sec

* When user specifies bandwidth with
        virsh migrate --available-switchover-bandwidth <number>
bandwidth (in MiB/s) available for the final phase of migration

* Asking users to specify such large number for bytes/sec is not user
friendly. 100mbps OR 156mbps OR 400mbps would have been better, then
above function could do

     if (!switchover_bw) {
         switchover_bw = s->mbps;
     }
     return switchover_bw / 8 * 1000 * 1000

>  bool migration_is_running(void)
>  {
>      MigrationState *s = current_migration;
> @@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
>  {
>      uint64_t transferred, transferred_pages, time_spent;
>      uint64_t current_bytes; /* bytes transferred since the beginning */
> -    uint64_t switchover_bw;
> -    /* Expected bandwidth when switching over to destination QEMU */
> -    double expected_bw_per_ms;
> -    double bandwidth;
> +    double switchover_bw;
>
>      if (current_time < s->iteration_start_time + BUFFER_DELAY) {
>          return;
>      }
>
> -    switchover_bw = migrate_avail_switchover_bandwidth();
>      current_bytes = migration_transferred_bytes();
>      transferred = current_bytes - s->iteration_initial_bytes;
>      time_spent = current_time - s->iteration_start_time;
> -    bandwidth = (double)transferred / time_spent;
> -
> -    if (switchover_bw) {
> -        /*
> -         * If the user specified a switchover bandwidth, let's trust the
> -         * user so that can be more accurate than what we estimated.
> -         */
> -        expected_bw_per_ms = (double)switchover_bw / 1000;
> -    } else {
> -        /* If the user doesn't specify bandwidth, we use the estimated */
> -        expected_bw_per_ms = bandwidth;
> -    }
> -
> -    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> -
>      s->mbps = (((double) transferred * 8.0) /
>                 ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>
> +    /* NOTE: only update this after bandwidth (s->mbps) updated */
> +    switchover_bw = migration_get_switchover_bw(s) / 1000;

* We got 12.5 MB/sec return value, divide it by 1000 to get bytes/ms
     ie. 12500000 / 1000  ==> 12500 bytes/ms

* It'll help to describe an example calculation in comments around
this math. Instead of dividing by 1000 here, why not make
migration_get_switchover_bw() return a bytes/ms value? That way the
actual calculation and the sample calculation in comments can go in
that function.

> +    s->threshold_size = switchover_bw * migrate_downtime_limit();
> +
           s->threshold_size = 12500 bytes/ms * 300ms  =>  3750000 bytes / 300ms

* IIUC, when pending_size  <=  s->threshold_size(=3.75MB), we can
pause the source VM and switch to the Postcopy phase, right?

>      transferred_pages = ram_get_total_transferred_pages() -
>                              s->iteration_initial_pages;
>      s->pages_per_second = (double) transferred_pages /
> @@ -3178,7 +3178,8 @@ static void migration_update_counters(MigrationState *s,
>
>      trace_migrate_transferred(transferred, time_spent,
>                                /* Both in unit bytes/ms */
> -                              bandwidth, switchover_bw / 1000,
> +                              (uint64_t)s->mbps,
> +                              (uint64_t)switchover_bw,
>                                s->threshold_size);

* Here 's->mbps' is not in bytes/ms unit, is it?

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
  2026-03-23 10:26   ` Prasad Pandit
@ 2026-03-27 17:07   ` Juraj Marcin
  2026-04-07 17:27     ` Peter Xu
  1 sibling, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 17:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On 2026-03-19 19:13, Peter Xu wrote:
> Add a helper to be able to return an estimate of switchover bw, called
> migration_get_switchover_bw(). Use it to sligitly simplify the current
> code.
> 
> This will be used in new codes later to remove expected_downtime.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h |  1 +
>  migration/migration.c | 43 ++++++++++++++++++++++---------------------
>  2 files changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index b6888daced..bf3ee6cc07 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -586,6 +586,7 @@ void migration_cancel(void);
>  void migration_populate_vfio_info(MigrationInfo *info);
>  void migration_reset_vfio_bytes_transferred(void);
>  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> +int64_t migration_downtime_calc_expected(MigrationState *s);

Couldn't find the definition until I looked at the next patch :)

>  
>  /*
>   * Migration thread waiting for return path thread.  Return non-zero if an
> diff --git a/migration/migration.c b/migration/migration.c
> index ad8a824585..56d605ede9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
>      migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
>  }
>  
> +/*
> + * Returns the estimated switchover bandwidth (unit: bytes / seconds)
> + */
> +static double migration_get_switchover_bw(MigrationState *s)
> +{
> +    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
> +
> +    if (switchover_bw) {
> +        /* If user specified, prioritize this value and don't estimate */
> +        return (double)switchover_bw;
> +    }
> +
> +    return s->mbps / 8 * 1000 * 1000;
> +}
> +
>  bool migration_is_running(void)
>  {
>      MigrationState *s = current_migration;
> @@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
>  {
>      uint64_t transferred, transferred_pages, time_spent;
>      uint64_t current_bytes; /* bytes transferred since the beginning */
> -    uint64_t switchover_bw;
> -    /* Expected bandwidth when switching over to destination QEMU */
> -    double expected_bw_per_ms;
> -    double bandwidth;
> +    double switchover_bw;
>  
>      if (current_time < s->iteration_start_time + BUFFER_DELAY) {
>          return;
>      }
>  
> -    switchover_bw = migrate_avail_switchover_bandwidth();
>      current_bytes = migration_transferred_bytes();
>      transferred = current_bytes - s->iteration_initial_bytes;
>      time_spent = current_time - s->iteration_start_time;
> -    bandwidth = (double)transferred / time_spent;
> -
> -    if (switchover_bw) {
> -        /*
> -         * If the user specified a switchover bandwidth, let's trust the
> -         * user so that can be more accurate than what we estimated.
> -         */
> -        expected_bw_per_ms = (double)switchover_bw / 1000;
> -    } else {
> -        /* If the user doesn't specify bandwidth, we use the estimated */
> -        expected_bw_per_ms = bandwidth;
> -    }
> -
> -    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> -
>      s->mbps = (((double) transferred * 8.0) /
>                 ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>  
> +    /* NOTE: only update this after bandwidth (s->mbps) updated */
> +    switchover_bw = migration_get_switchover_bw(s) / 1000;

It isn't clear from the name, that switchover_bw is in milliseconds, I
think the _per_ms suffix would solve this.

> +    s->threshold_size = switchover_bw * migrate_downtime_limit();
> +
>      transferred_pages = ram_get_total_transferred_pages() -
>                              s->iteration_initial_pages;
>      s->pages_per_second = (double) transferred_pages /

There's also expected_bw_per_ms which should be probably replaced by
switchover_bw[_per_ms], otherwise it doesn't compile.

> @@ -3178,7 +3178,8 @@ static void migration_update_counters(MigrationState *s,
>  
>      trace_migrate_transferred(transferred, time_spent,
>                                /* Both in unit bytes/ms */
> -                              bandwidth, switchover_bw / 1000,
> +                              (uint64_t)s->mbps,
> +                              (uint64_t)switchover_bw,
>                                s->threshold_size);
>  }
>  
> -- 
> 2.50.1
> 

Except the comments above, the change looks good. With those resolved:

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-03-27 17:07   ` Juraj Marcin
@ 2026-04-07 17:27     ` Peter Xu
  2026-04-08 14:33       ` Juraj Marcin
  0 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-04-07 17:27 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, Mar 27, 2026 at 06:07:38PM +0100, Juraj Marcin wrote:
> On 2026-03-19 19:13, Peter Xu wrote:
> > Add a helper to be able to return an estimate of switchover bw, called
> > migration_get_switchover_bw(). Use it to sligitly simplify the current
> > code.
> > 
> > This will be used in new codes later to remove expected_downtime.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.h |  1 +
> >  migration/migration.c | 43 ++++++++++++++++++++++---------------------
> >  2 files changed, 23 insertions(+), 21 deletions(-)
> > 
> > diff --git a/migration/migration.h b/migration/migration.h
> > index b6888daced..bf3ee6cc07 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -586,6 +586,7 @@ void migration_cancel(void);
> >  void migration_populate_vfio_info(MigrationInfo *info);
> >  void migration_reset_vfio_bytes_transferred(void);
> >  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> > +int64_t migration_downtime_calc_expected(MigrationState *s);
> 
> Couldn't find the definition until I looked at the next patch :)

Good catch.. I'll move it out.

> 
> >  
> >  /*
> >   * Migration thread waiting for return path thread.  Return non-zero if an
> > diff --git a/migration/migration.c b/migration/migration.c
> > index ad8a824585..56d605ede9 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
> >      migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
> >  }
> >  
> > +/*
> > + * Returns the estimated switchover bandwidth (unit: bytes / seconds)
> > + */
> > +static double migration_get_switchover_bw(MigrationState *s)
> > +{
> > +    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
> > +
> > +    if (switchover_bw) {
> > +        /* If user specified, prioritize this value and don't estimate */
> > +        return (double)switchover_bw;
> > +    }
> > +
> > +    return s->mbps / 8 * 1000 * 1000;
> > +}
> > +
> >  bool migration_is_running(void)
> >  {
> >      MigrationState *s = current_migration;
> > @@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
> >  {
> >      uint64_t transferred, transferred_pages, time_spent;
> >      uint64_t current_bytes; /* bytes transferred since the beginning */
> > -    uint64_t switchover_bw;
> > -    /* Expected bandwidth when switching over to destination QEMU */
> > -    double expected_bw_per_ms;
> > -    double bandwidth;
> > +    double switchover_bw;
> >  
> >      if (current_time < s->iteration_start_time + BUFFER_DELAY) {
> >          return;
> >      }
> >  
> > -    switchover_bw = migrate_avail_switchover_bandwidth();
> >      current_bytes = migration_transferred_bytes();
> >      transferred = current_bytes - s->iteration_initial_bytes;
> >      time_spent = current_time - s->iteration_start_time;
> > -    bandwidth = (double)transferred / time_spent;
> > -
> > -    if (switchover_bw) {
> > -        /*
> > -         * If the user specified a switchover bandwidth, let's trust the
> > -         * user so that can be more accurate than what we estimated.
> > -         */
> > -        expected_bw_per_ms = (double)switchover_bw / 1000;
> > -    } else {
> > -        /* If the user doesn't specify bandwidth, we use the estimated */
> > -        expected_bw_per_ms = bandwidth;
> > -    }
> > -
> > -    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> > -
> >      s->mbps = (((double) transferred * 8.0) /
> >                 ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
> >  
> > +    /* NOTE: only update this after bandwidth (s->mbps) updated */
> > +    switchover_bw = migration_get_switchover_bw(s) / 1000;
> 
> It isn't clear from the name, that switchover_bw is in milliseconds, I
> think the _per_ms suffix would solve this.

Sure.

> 
> > +    s->threshold_size = switchover_bw * migrate_downtime_limit();
> > +
> >      transferred_pages = ram_get_total_transferred_pages() -
> >                              s->iteration_initial_pages;
> >      s->pages_per_second = (double) transferred_pages /
> 
> There's also expected_bw_per_ms which should be probably replaced by
> switchover_bw[_per_ms], otherwise it doesn't compile.

Yep, will fix.  I'll likely drop two qatomic_read()s too alone the way when
touching that line because they're not needed.

> 
> > @@ -3178,7 +3178,8 @@ static void migration_update_counters(MigrationState *s,
> >  
> >      trace_migrate_transferred(transferred, time_spent,
> >                                /* Both in unit bytes/ms */
> > -                              bandwidth, switchover_bw / 1000,
> > +                              (uint64_t)s->mbps,
> > +                              (uint64_t)switchover_bw,
> >                                s->threshold_size);
> >  }
> >  
> > -- 
> > 2.50.1
> > 
> 
> Except the comments above, the change looks good. With those resolved:
> 
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>

Thanks, attaching the final version here (I also touched up the commit
message on some English change request from Prasad):

From 2258026f454621b7a5f04bb7bcb018b29482a695 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 19 Mar 2026 15:50:30 -0400
Subject: [PATCH] migration: Introduce a helper to return switchover bw
 estimate

Add a helper migration_get_switchover_bw() to return an estimate of
switchover bandwidth.  Use it to simplify the current code.

This will be used in later to remove expected_downtime.

When at it, remove two qatomic_read() to shrink the lines because atomic
ops are not needed when it's always the same thread who does the updates.

Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 48 +++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index c6cc2f6c09..17003ac7ee 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
     migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
 }
 
+/*
+ * Returns the estimated switchover bandwidth (unit: bytes / seconds)
+ */
+static double migration_get_switchover_bw(MigrationState *s)
+{
+    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
+
+    if (switchover_bw) {
+        /* If user specified, prioritize this value and don't estimate */
+        return (double)switchover_bw;
+    }
+
+    return s->mbps / 8 * 1000 * 1000;
+}
+
 bool migration_is_running(void)
 {
     MigrationState *s = current_migration;
@@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
 {
     uint64_t transferred, transferred_pages, time_spent;
     uint64_t current_bytes; /* bytes transferred since the beginning */
-    uint64_t switchover_bw;
-    /* Expected bandwidth when switching over to destination QEMU */
-    double expected_bw_per_ms;
-    double bandwidth;
+    double switchover_bw_per_ms;
 
     if (current_time < s->iteration_start_time + BUFFER_DELAY) {
         return;
     }
 
-    switchover_bw = migrate_avail_switchover_bandwidth();
     current_bytes = migration_transferred_bytes();
     transferred = current_bytes - s->iteration_initial_bytes;
     time_spent = current_time - s->iteration_start_time;
-    bandwidth = (double)transferred / time_spent;
-
-    if (switchover_bw) {
-        /*
-         * If the user specified a switchover bandwidth, let's trust the
-         * user so that can be more accurate than what we estimated.
-         */
-        expected_bw_per_ms = (double)switchover_bw / 1000;
-    } else {
-        /* If the user doesn't specify bandwidth, we use the estimated */
-        expected_bw_per_ms = bandwidth;
-    }
-
-    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
-
     s->mbps = (((double) transferred * 8.0) /
                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
 
+    /* NOTE: only update this after bandwidth (s->mbps) updated */
+    switchover_bw_per_ms = migration_get_switchover_bw(s) / 1000;
+    s->threshold_size = switchover_bw_per_ms * migrate_downtime_limit();
+
     transferred_pages = ram_get_total_transferred_pages() -
                             s->iteration_initial_pages;
     s->pages_per_second = (double) transferred_pages /
@@ -3166,10 +3166,9 @@ static void migration_update_counters(MigrationState *s,
      * if we haven't sent anything, we don't want to
      * recalculate. 10000 is a small enough number for our purposes
      */
-    if (qatomic_read(&mig_stats.dirty_pages_rate) &&
-        transferred > 10000) {
+    if (mig_stats.dirty_pages_rate && transferred > 10000) {
         s->expected_downtime =
-            qatomic_read(&mig_stats.dirty_bytes_last_sync) / expected_bw_per_ms;
+            mig_stats.dirty_bytes_last_sync / switchover_bw_per_ms;
     }
 
     migration_rate_reset();
@@ -3178,7 +3177,8 @@ static void migration_update_counters(MigrationState *s,
 
     trace_migrate_transferred(transferred, time_spent,
                               /* Both in unit bytes/ms */
-                              bandwidth, switchover_bw / 1000,
+                              (uint64_t)s->mbps,
+                              (uint64_t)switchover_bw_per_ms,
                               s->threshold_size);
 }
 
-- 
2.53.0


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate
  2026-04-07 17:27     ` Peter Xu
@ 2026-04-08 14:33       ` Juraj Marcin
  0 siblings, 0 replies; 67+ messages in thread
From: Juraj Marcin @ 2026-04-08 14:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On 2026-04-07 13:27, Peter Xu wrote:
> On Fri, Mar 27, 2026 at 06:07:38PM +0100, Juraj Marcin wrote:
> > On 2026-03-19 19:13, Peter Xu wrote:
> > > Add a helper to be able to return an estimate of switchover bw, called
> > > migration_get_switchover_bw(). Use it to sligitly simplify the current
> > > code.
> > > 
> > > This will be used in new codes later to remove expected_downtime.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  migration/migration.h |  1 +
> > >  migration/migration.c | 43 ++++++++++++++++++++++---------------------
> > >  2 files changed, 23 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index b6888daced..bf3ee6cc07 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -586,6 +586,7 @@ void migration_cancel(void);
> > >  void migration_populate_vfio_info(MigrationInfo *info);
> > >  void migration_reset_vfio_bytes_transferred(void);
> > >  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> > > +int64_t migration_downtime_calc_expected(MigrationState *s);
> > 
> > Couldn't find the definition until I looked at the next patch :)
> 
> Good catch.. I'll move it out.
> 
...
> > > +    /* NOTE: only update this after bandwidth (s->mbps) updated */
> > > +    switchover_bw = migration_get_switchover_bw(s) / 1000;
> > 
> > It isn't clear from the name, that switchover_bw is in milliseconds, I
> > think the _per_ms suffix would solve this.
> 
> Sure.
> 
> > 
> > > +    s->threshold_size = switchover_bw * migrate_downtime_limit();
> > > +
> > >      transferred_pages = ram_get_total_transferred_pages() -
> > >                              s->iteration_initial_pages;
> > >      s->pages_per_second = (double) transferred_pages /
> > 
> > There's also expected_bw_per_ms which should be probably replaced by
> > switchover_bw[_per_ms], otherwise it doesn't compile.
> 
> Yep, will fix.  I'll likely drop two qatomic_read()s too alone the way when
> touching that line because they're not needed.
> 
...
> > 
> > Except the comments above, the change looks good. With those resolved:
> > 
> > Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
> 
> Thanks, attaching the final version here (I also touched up the commit
> message on some English change request from Prasad):
> 
> From 2258026f454621b7a5f04bb7bcb018b29482a695 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 19 Mar 2026 15:50:30 -0400
> Subject: [PATCH] migration: Introduce a helper to return switchover bw
>  estimate
> 
> Add a helper migration_get_switchover_bw() to return an estimate of
> switchover bandwidth.  Use it to simplify the current code.
> 
> This will be used in later to remove expected_downtime.
> 
> When at it, remove two qatomic_read() to shrink the lines because atomic
> ops are not needed when it's always the same thread who does the updates.
> 
> Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 48 +++++++++++++++++++++----------------------
>  1 file changed, 24 insertions(+), 24 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index c6cc2f6c09..17003ac7ee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -984,6 +984,21 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value)
>      migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf);
>  }
>  
> +/*
> + * Returns the estimated switchover bandwidth (unit: bytes / seconds)
> + */
> +static double migration_get_switchover_bw(MigrationState *s)
> +{
> +    uint64_t switchover_bw = migrate_avail_switchover_bandwidth();
> +
> +    if (switchover_bw) {
> +        /* If user specified, prioritize this value and don't estimate */
> +        return (double)switchover_bw;
> +    }
> +
> +    return s->mbps / 8 * 1000 * 1000;
> +}
> +
>  bool migration_is_running(void)
>  {
>      MigrationState *s = current_migration;
> @@ -3126,37 +3141,22 @@ static void migration_update_counters(MigrationState *s,
>  {
>      uint64_t transferred, transferred_pages, time_spent;
>      uint64_t current_bytes; /* bytes transferred since the beginning */
> -    uint64_t switchover_bw;
> -    /* Expected bandwidth when switching over to destination QEMU */
> -    double expected_bw_per_ms;
> -    double bandwidth;
> +    double switchover_bw_per_ms;
>  
>      if (current_time < s->iteration_start_time + BUFFER_DELAY) {
>          return;
>      }
>  
> -    switchover_bw = migrate_avail_switchover_bandwidth();
>      current_bytes = migration_transferred_bytes();
>      transferred = current_bytes - s->iteration_initial_bytes;
>      time_spent = current_time - s->iteration_start_time;
> -    bandwidth = (double)transferred / time_spent;
> -
> -    if (switchover_bw) {
> -        /*
> -         * If the user specified a switchover bandwidth, let's trust the
> -         * user so that can be more accurate than what we estimated.
> -         */
> -        expected_bw_per_ms = (double)switchover_bw / 1000;
> -    } else {
> -        /* If the user doesn't specify bandwidth, we use the estimated */
> -        expected_bw_per_ms = bandwidth;
> -    }
> -
> -    s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> -
>      s->mbps = (((double) transferred * 8.0) /
>                 ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>  
> +    /* NOTE: only update this after bandwidth (s->mbps) updated */
> +    switchover_bw_per_ms = migration_get_switchover_bw(s) / 1000;
> +    s->threshold_size = switchover_bw_per_ms * migrate_downtime_limit();
> +
>      transferred_pages = ram_get_total_transferred_pages() -
>                              s->iteration_initial_pages;
>      s->pages_per_second = (double) transferred_pages /
> @@ -3166,10 +3166,9 @@ static void migration_update_counters(MigrationState *s,
>       * if we haven't sent anything, we don't want to
>       * recalculate. 10000 is a small enough number for our purposes
>       */
> -    if (qatomic_read(&mig_stats.dirty_pages_rate) &&
> -        transferred > 10000) {
> +    if (mig_stats.dirty_pages_rate && transferred > 10000) {
>          s->expected_downtime =
> -            qatomic_read(&mig_stats.dirty_bytes_last_sync) / expected_bw_per_ms;
> +            mig_stats.dirty_bytes_last_sync / switchover_bw_per_ms;
>      }
>  
>      migration_rate_reset();
> @@ -3178,7 +3177,8 @@ static void migration_update_counters(MigrationState *s,
>  
>      trace_migrate_transferred(transferred, time_spent,
>                                /* Both in unit bytes/ms */
> -                              bandwidth, switchover_bw / 1000,
> +                              (uint64_t)s->mbps,
> +                              (uint64_t)switchover_bw_per_ms,
>                                s->threshold_size);
>  }
>  
> -- 
> 2.53.0
> 
> 
> -- 
> Peter Xu
> 

LGTM!



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 11/12] migration: Calculate expected downtime on demand
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (9 preceding siblings ...)
  2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
@ 2026-03-19 23:13 ` Peter Xu
  2026-03-27 17:17   ` Juraj Marcin
  2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
  11 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

This value does not need to be calculated as frequent.  Only calculate it
on demand when query-migrate happened.  With that we can remove the
variable in MigrationState.

This paves way for fixing this value to include all modules (not only RAM
but others too).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h |  1 -
 migration/migration.c | 26 ++++++++++++--------------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index bf3ee6cc07..ba0f9e0f9c 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -359,7 +359,6 @@ struct MigrationState {
     /* Timestamp when VM is down (ms) to migrate the last stuff */
     int64_t downtime_start;
     int64_t downtime;
-    int64_t expected_downtime;
     bool capabilities[MIGRATION_CAPABILITY__MAX];
     int64_t setup_time;
 
diff --git a/migration/migration.c b/migration/migration.c
index 56d605ede9..23c78b3a2c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1041,6 +1041,17 @@ static bool migrate_show_downtime(MigrationState *s)
     return (s->state == MIGRATION_STATUS_COMPLETED) || migration_in_postcopy();
 }
 
+/* Return expected downtime (unit: milliseconds) */
+int64_t migration_downtime_calc_expected(MigrationState *s)
+{
+    if (mig_stats.dirty_sync_count <= 1) {
+        return migrate_downtime_limit();
+    }
+
+    return 1.0 * mig_stats.dirty_bytes_last_sync /
+        migration_get_switchover_bw(s) * 1000;
+}
+
 static void populate_time_info(MigrationInfo *info, MigrationState *s)
 {
     info->has_status = true;
@@ -1061,7 +1072,7 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
         info->downtime = s->downtime;
     } else {
         info->has_expected_downtime = true;
-        info->expected_downtime = s->expected_downtime;
+        info->expected_downtime = migration_downtime_calc_expected(s);
     }
 }
 
@@ -1649,7 +1660,6 @@ int migrate_init(MigrationState *s, Error **errp)
     s->mbps = 0.0;
     s->pages_per_second = 0.0;
     s->downtime = 0;
-    s->expected_downtime = 0;
     s->setup_time = 0;
     s->start_postcopy = false;
     s->migration_thread_running = false;
@@ -3162,16 +3172,6 @@ static void migration_update_counters(MigrationState *s,
     s->pages_per_second = (double) transferred_pages /
                              (((double) time_spent / 1000.0));
 
-    /*
-     * if we haven't sent anything, we don't want to
-     * recalculate. 10000 is a small enough number for our purposes
-     */
-    if (qatomic_read(&mig_stats.dirty_pages_rate) &&
-        transferred > 10000) {
-        s->expected_downtime =
-            qatomic_read(&mig_stats.dirty_bytes_last_sync) / expected_bw_per_ms;
-    }
-
     migration_rate_reset();
 
     update_iteration_initial_status(s);
@@ -3817,8 +3817,6 @@ void migration_start_outgoing(MigrationState *s)
     bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
     int ret;
 
-    s->expected_downtime = migrate_downtime_limit();
-
     if (resume) {
         /* This is a resumed migration */
         rate_limit = migrate_max_postcopy_bandwidth();
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 11/12] migration: Calculate expected downtime on demand
  2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
@ 2026-03-27 17:17   ` Juraj Marcin
  2026-04-07 17:33     ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Juraj Marcin @ 2026-03-27 17:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On 2026-03-19 19:13, Peter Xu wrote:
> This value does not need to be calculated as frequent.  Only calculate it
> on demand when query-migrate happened.  With that we can remove the
> variable in MigrationState.
> 
> This paves way for fixing this value to include all modules (not only RAM
> but others too).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.h |  1 -
>  migration/migration.c | 26 ++++++++++++--------------
>  2 files changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/migration/migration.h b/migration/migration.h
> index bf3ee6cc07..ba0f9e0f9c 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -359,7 +359,6 @@ struct MigrationState {
>      /* Timestamp when VM is down (ms) to migrate the last stuff */
>      int64_t downtime_start;
>      int64_t downtime;
> -    int64_t expected_downtime;
>      bool capabilities[MIGRATION_CAPABILITY__MAX];
>      int64_t setup_time;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 56d605ede9..23c78b3a2c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1041,6 +1041,17 @@ static bool migrate_show_downtime(MigrationState *s)
>      return (s->state == MIGRATION_STATUS_COMPLETED) || migration_in_postcopy();
>  }
>  
> +/* Return expected downtime (unit: milliseconds) */
> +int64_t migration_downtime_calc_expected(MigrationState *s)
> +{
> +    if (mig_stats.dirty_sync_count <= 1) {
> +        return migrate_downtime_limit();
> +    }
> +
> +    return 1.0 * mig_stats.dirty_bytes_last_sync /

If the intention of '1.0 *' was to get double out of dirty_bytes_last_sync, I think
an explicit cast would be better.

> +        migration_get_switchover_bw(s) * 1000;
> +}
> +
>  static void populate_time_info(MigrationInfo *info, MigrationState *s)
>  {
>      info->has_status = true;
> @@ -1061,7 +1072,7 @@ static void populate_time_info(MigrationInfo *info, MigrationState *s)
>          info->downtime = s->downtime;
>      } else {
>          info->has_expected_downtime = true;
> -        info->expected_downtime = s->expected_downtime;
> +        info->expected_downtime = migration_downtime_calc_expected(s);
>      }
>  }
>  
> @@ -1649,7 +1660,6 @@ int migrate_init(MigrationState *s, Error **errp)
>      s->mbps = 0.0;
>      s->pages_per_second = 0.0;
>      s->downtime = 0;
> -    s->expected_downtime = 0;
>      s->setup_time = 0;
>      s->start_postcopy = false;
>      s->migration_thread_running = false;
> @@ -3162,16 +3172,6 @@ static void migration_update_counters(MigrationState *s,
>      s->pages_per_second = (double) transferred_pages /
>                               (((double) time_spent / 1000.0));
>  
> -    /*
> -     * if we haven't sent anything, we don't want to
> -     * recalculate. 10000 is a small enough number for our purposes
> -     */
> -    if (qatomic_read(&mig_stats.dirty_pages_rate) &&
> -        transferred > 10000) {
> -        s->expected_downtime =
> -            qatomic_read(&mig_stats.dirty_bytes_last_sync) / expected_bw_per_ms;
> -    }
> -
>      migration_rate_reset();
>  
>      update_iteration_initial_status(s);
> @@ -3817,8 +3817,6 @@ void migration_start_outgoing(MigrationState *s)
>      bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
>      int ret;
>  
> -    s->expected_downtime = migrate_downtime_limit();
> -
>      if (resume) {
>          /* This is a resumed migration */
>          rate_limit = migrate_max_postcopy_bandwidth();
> -- 
> 2.50.1
> 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 11/12] migration: Calculate expected downtime on demand
  2026-03-27 17:17   ` Juraj Marcin
@ 2026-04-07 17:33     ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-07 17:33 UTC (permalink / raw)
  To: Juraj Marcin
  Cc: qemu-devel, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, Mar 27, 2026 at 06:17:03PM +0100, Juraj Marcin wrote:
> > +/* Return expected downtime (unit: milliseconds) */
> > +int64_t migration_downtime_calc_expected(MigrationState *s)
> > +{
> > +    if (mig_stats.dirty_sync_count <= 1) {
> > +        return migrate_downtime_limit();
> > +    }
> > +
> > +    return 1.0 * mig_stats.dirty_bytes_last_sync /
> 
> If the intention of '1.0 *' was to get double out of dirty_bytes_last_sync, I think
> an explicit cast would be better.

Let me drop the "1.0 *"; it's not needed when the divisor is double.

Thanks,

> 
> > +        migration_get_switchover_bw(s) * 1000;
> > +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info
  2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
                   ` (10 preceding siblings ...)
  2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
@ 2026-03-19 23:13 ` Peter Xu
  2026-03-23 12:05   ` Prasad Pandit
  11 siblings, 1 reply; 67+ messages in thread
From: Peter Xu @ 2026-03-19 23:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, peterx, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

QEMU will provide an expected downtime for the whole system during
migration, by remembering the total dirty RAM that we synced the last time,
divides the estimated switchover bandwidth.

That was flawed when VFIO is taking into account: consider there is a VFIO
GPU device that contains GBs of data to migrate during stop phase.  Those
will not be accounted in this math.

Fix it by updating dirty_bytes_last_sync properly only when we go to the
next iteration, rather than hide this update in the RAM code.  Meanwhile,
fetch the total (rather than RAM-only) portion of dirty bytes, so as to
include GPU device states too.

Update the comment of the field to reflect its new meaning.

Now after this change, the expected-downtime to be read from query-migrate
should be very accurate even with VFIO devices involved.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-stats.h | 10 +++++-----
 migration/migration.c       | 11 ++++++++---
 migration/ram.c             |  1 -
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 326ddb0088..14b2773beb 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -31,11 +31,11 @@
  */
 typedef struct {
     /*
-     * Number of bytes that were dirty last time that we synced with
-     * the guest memory.  We use that to calculate the downtime.  As
-     * the remaining dirty amounts to what we know that is still dirty
-     * since last iteration, not counting what the guest has dirtied
-     * since we synchronized bitmaps.
+     * Number of bytes that are still dirty after the last whole-system
+     * sync on dirty information.  We use that to calculate the expected
+     * downtime.  As the remaining dirty amounts to what we know that is
+     * still dirty since last iteration, not counting what the guest has
+     * dirtied since then on either RAM or device states.
      */
     uint64_t dirty_bytes_last_sync;
     /*
diff --git a/migration/migration.c b/migration/migration.c
index 23c78b3a2c..1c00572d14 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3240,18 +3240,23 @@ static void migration_iteration_go_next(MigPendingData *pending)
      */
     qemu_savevm_query_pending(pending, false);
 
+    /*
+     * Update the dirty information for the whole system for this
+     * iteration.  This value is used to calculate expected downtime.
+     */
+    qatomic_set(&mig_stats.dirty_bytes_last_sync, pending->total_bytes);
+
     /*
      * Boost dirty sync count to reflect we finished one iteration.
      *
      * NOTE: we need to make sure when this happens (together with the
      * event sent below) all modules have slow-synced the pending data
-     * above.  That means a write mem barrier, but qatomic_add() should be
-     * enough.
+     * above and updated corresponding fields (e.g. dirty_bytes_last_sync).
      *
      * It's because a mgmt could wait on the iteration event to query again
      * on pending data for policy changes (e.g. downtime adjustments).  The
      * ordering will make sure the query will fetch the latest results from
-     * all the modules.
+     * all the modules on everything.
      */
     qatomic_add(&mig_stats.dirty_sync_count, 1);
 
diff --git a/migration/ram.c b/migration/ram.c
index 29e9608715..1bdf121d16 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1148,7 +1148,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
             RAMBLOCK_FOREACH_NOT_IGNORED(block) {
                 ramblock_sync_dirty_bitmap(rs, block);
             }
-            qatomic_set(&mig_stats.dirty_bytes_last_sync, ram_bytes_remaining());
         }
     }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info
  2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
@ 2026-03-23 12:05   ` Prasad Pandit
  2026-04-07 17:40     ` Peter Xu
  0 siblings, 1 reply; 67+ messages in thread
From: Prasad Pandit @ 2026-03-23 12:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> QEMU will provide an expected downtime for the whole system during
> migration, by remembering the total dirty RAM that we synced the last time,
> divides the estimated switchover bandwidth.

* ie. Total dirty RAM synced / estimated bandwidth? If so, divides -> divided by

> That was flawed when VFIO is taking into account: consider there is a VFIO
> GPU device that contains GBs of data to migrate during stop phase.  Those
> will not be accounted in this math.
>
> Fix it by updating dirty_bytes_last_sync properly only when we go to the
> next iteration, rather than hide this update in the RAM code.  Meanwhile,
> fetch the total (rather than RAM-only) portion of dirty bytes, so as to
> include GPU device states too.
>
> Update the comment of the field to reflect its new meaning.
>
> Now after this change, the expected-downtime to be read from query-migrate
> should be very accurate even with VFIO devices involved.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration-stats.h | 10 +++++-----
>  migration/migration.c       | 11 ++++++++---
>  migration/ram.c             |  1 -
>  3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 326ddb0088..14b2773beb 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -31,11 +31,11 @@
>   */
>  typedef struct {
>      /*
> -     * Number of bytes that were dirty last time that we synced with
> -     * the guest memory.  We use that to calculate the downtime.  As
> -     * the remaining dirty amounts to what we know that is still dirty
> -     * since last iteration, not counting what the guest has dirtied
> -     * since we synchronized bitmaps.
> +     * Number of bytes that are still dirty after the last whole-system
> +     * sync on dirty information.  We use that to calculate the expected
> +     * downtime.  As the remaining dirty amounts to what we know that is
> +     * still dirty since last iteration, not counting what the guest has
> +     * dirtied since then on either RAM or device states.
>       */

* Multiple uses of dirty is making it rather unclear. (any funny)

* May be:
     Number of bytes pending since the last full-system accounting
     of the dirty bytes' information. We use this number to calculate
     the expected downtime. The pending bytes do not account for
     the new bytes the guest may have written since the last update
     of the dirty bytes' information.

>      uint64_t dirty_bytes_last_sync;
>      /*
> diff --git a/migration/migration.c b/migration/migration.c
> index 23c78b3a2c..1c00572d14 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3240,18 +3240,23 @@ static void migration_iteration_go_next(MigPendingData *pending)
>       */
>      qemu_savevm_query_pending(pending, false);
>
> +    /*
> +     * Update the dirty information for the whole system for this

* dirty bytes' information

> +     * iteration.  This value is used to calculate expected downtime.
> +     */
> +    qatomic_set(&mig_stats.dirty_bytes_last_sync, pending->total_bytes);
> +
>      /*
>       * Boost dirty sync count to reflect we finished one iteration.
>       *
>       * NOTE: we need to make sure when this happens (together with the
>       * event sent below) all modules have slow-synced the pending data
> -     * above.  That means a write mem barrier, but qatomic_add() should be
> -     * enough.
> +     * above and updated corresponding fields (e.g. dirty_bytes_last_sync).
>       *
>       * It's because a mgmt could wait on the iteration event to query again
>       * on pending data for policy changes (e.g. downtime adjustments).  The
>       * ordering will make sure the query will fetch the latest results from
> -     * all the modules.
> +     * all the modules on everything.
>       */
>      qatomic_add(&mig_stats.dirty_sync_count, 1);
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 29e9608715..1bdf121d16 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1148,7 +1148,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
>              RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>                  ramblock_sync_dirty_bitmap(rs, block);
>              }
> -            qatomic_set(&mig_stats.dirty_bytes_last_sync, ram_bytes_remaining());

* Shouldn't the ram_bytes_remaining() go into pending->total_bytes
here? ie pending->total_bytes += ram_bytes_remaining()


Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info
  2026-03-23 12:05   ` Prasad Pandit
@ 2026-04-07 17:40     ` Peter Xu
  0 siblings, 0 replies; 67+ messages in thread
From: Peter Xu @ 2026-04-07 17:40 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: qemu-devel, Juraj Marcin, Kirti Wankhede, Maciej S . Szmigiero,
	Daniel P . Berrangé, Joao Martins, Alex Williamson,
	Yishai Hadas, Fabiano Rosas, Pranav Tyagi, Zhiyi Guo,
	Markus Armbruster, Avihai Horon, Cédric Le Goater

On Mon, Mar 23, 2026 at 05:35:32PM +0530, Prasad Pandit wrote:
> On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> > QEMU will provide an expected downtime for the whole system during
> > migration, by remembering the total dirty RAM that we synced the last time,
> > divides the estimated switchover bandwidth.
> 
> * ie. Total dirty RAM synced / estimated bandwidth? If so, divides -> divided by
> 
> > That was flawed when VFIO is taking into account: consider there is a VFIO
> > GPU device that contains GBs of data to migrate during stop phase.  Those
> > will not be accounted in this math.
> >
> > Fix it by updating dirty_bytes_last_sync properly only when we go to the
> > next iteration, rather than hide this update in the RAM code.  Meanwhile,
> > fetch the total (rather than RAM-only) portion of dirty bytes, so as to
> > include GPU device states too.
> >
> > Update the comment of the field to reflect its new meaning.
> >
> > Now after this change, the expected-downtime to be read from query-migrate
> > should be very accurate even with VFIO devices involved.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration-stats.h | 10 +++++-----
> >  migration/migration.c       | 11 ++++++++---
> >  migration/ram.c             |  1 -
> >  3 files changed, 13 insertions(+), 9 deletions(-)
> >
> > diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> > index 326ddb0088..14b2773beb 100644
> > --- a/migration/migration-stats.h
> > +++ b/migration/migration-stats.h
> > @@ -31,11 +31,11 @@
> >   */
> >  typedef struct {
> >      /*
> > -     * Number of bytes that were dirty last time that we synced with
> > -     * the guest memory.  We use that to calculate the downtime.  As
> > -     * the remaining dirty amounts to what we know that is still dirty
> > -     * since last iteration, not counting what the guest has dirtied
> > -     * since we synchronized bitmaps.
> > +     * Number of bytes that are still dirty after the last whole-system
> > +     * sync on dirty information.  We use that to calculate the expected
> > +     * downtime.  As the remaining dirty amounts to what we know that is
> > +     * still dirty since last iteration, not counting what the guest has
> > +     * dirtied since then on either RAM or device states.
> >       */
> 
> * Multiple uses of dirty is making it rather unclear. (any funny)
> 
> * May be:
>      Number of bytes pending since the last full-system accounting
>      of the dirty bytes' information. We use this number to calculate
>      the expected downtime. The pending bytes do not account for
>      the new bytes the guest may have written since the last update
>      of the dirty bytes' information.

I didn't change the rest except the scope of sync.  I'll simplify this
section to below:

    /*
     * Number of bytes that were reported dirty after the lastest
     * system-wise synchronization on dirty information.  It is used to
     * do best-effort estimation on expected downtime.
     */

> 
> >      uint64_t dirty_bytes_last_sync;
> >      /*
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 23c78b3a2c..1c00572d14 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3240,18 +3240,23 @@ static void migration_iteration_go_next(MigPendingData *pending)
> >       */
> >      qemu_savevm_query_pending(pending, false);
> >
> > +    /*
> > +     * Update the dirty information for the whole system for this
> 
> * dirty bytes' information

I prefer dirty information, because it may not always be about bytes,
e.g. bitmaps are in number of pages.

> 
> > +     * iteration.  This value is used to calculate expected downtime.
> > +     */
> > +    qatomic_set(&mig_stats.dirty_bytes_last_sync, pending->total_bytes);
> > +
> >      /*
> >       * Boost dirty sync count to reflect we finished one iteration.
> >       *
> >       * NOTE: we need to make sure when this happens (together with the
> >       * event sent below) all modules have slow-synced the pending data
> > -     * above.  That means a write mem barrier, but qatomic_add() should be
> > -     * enough.
> > +     * above and updated corresponding fields (e.g. dirty_bytes_last_sync).
> >       *
> >       * It's because a mgmt could wait on the iteration event to query again
> >       * on pending data for policy changes (e.g. downtime adjustments).  The
> >       * ordering will make sure the query will fetch the latest results from
> > -     * all the modules.
> > +     * all the modules on everything.
> >       */
> >      qatomic_add(&mig_stats.dirty_sync_count, 1);
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 29e9608715..1bdf121d16 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1148,7 +1148,6 @@ static void migration_bitmap_sync(RAMState *rs, bool last_stage)
> >              RAMBLOCK_FOREACH_NOT_IGNORED(block) {
> >                  ramblock_sync_dirty_bitmap(rs, block);
> >              }
> > -            qatomic_set(&mig_stats.dirty_bytes_last_sync, ram_bytes_remaining());
> 
> * Shouldn't the ram_bytes_remaining() go into pending->total_bytes
> here? ie pending->total_bytes += ram_bytes_remaining()

It's included via qemu_savevm_query_pending().

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2026-04-08 19:18 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
2026-03-20 12:26   ` Prasad Pandit
2026-03-27 14:35     ` Juraj Marcin
2026-03-30 11:52       ` Prasad Pandit
2026-03-31 12:49         ` Juraj Marcin
2026-04-06  7:21           ` Prasad Pandit
2026-04-01 19:11       ` Peter Xu
2026-03-27 15:05   ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
2026-03-19 23:26   ` Peter Xu
2026-03-20  6:54   ` Markus Armbruster
2026-04-01 19:38     ` Peter Xu
2026-04-01 19:47     ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
2026-03-25 14:10   ` Avihai Horon
2026-04-01 20:36     ` Peter Xu
2026-04-06 11:21       ` Avihai Horon
2026-04-07 15:18         ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
2026-03-25 14:15   ` Avihai Horon
2026-04-01 20:41     ` Peter Xu
2026-04-06 11:28       ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
2026-03-24 10:35   ` Prasad Pandit
2026-04-01 20:53     ` Peter Xu
2026-03-25 15:20   ` Avihai Horon
2026-04-01 21:22     ` Peter Xu
2026-04-06 11:54       ` Avihai Horon
2026-03-27 15:17   ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
2026-03-24  9:35   ` Prasad Pandit
2026-03-27 15:24   ` Juraj Marcin
2026-04-01 22:28     ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
2026-03-24 11:05   ` Prasad Pandit
2026-03-25 16:54   ` Avihai Horon
2026-04-02 14:09     ` Peter Xu
2026-04-06 12:20       ` Avihai Horon
2026-04-07 15:30         ` Peter Xu
2026-03-27 16:43   ` Juraj Marcin
2026-04-02 15:16     ` Peter Xu
2026-04-07 15:19       ` Juraj Marcin
2026-04-07 15:32         ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
2026-03-25 17:32   ` Avihai Horon
2026-04-02 15:28     ` Peter Xu
2026-04-02 15:55       ` Peter Xu
2026-04-06 12:34         ` Avihai Horon
2026-04-07 15:45           ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
2026-03-20  6:12   ` Yong Huang
2026-03-20  9:49   ` Prasad Pandit
2026-04-02 15:35     ` Peter Xu
2026-03-27 16:49   ` Juraj Marcin
2026-04-02 15:42     ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
2026-03-23 10:26   ` Prasad Pandit
2026-03-27 17:07   ` Juraj Marcin
2026-04-07 17:27     ` Peter Xu
2026-04-08 14:33       ` Juraj Marcin
2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
2026-03-27 17:17   ` Juraj Marcin
2026-04-07 17:33     ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
2026-03-23 12:05   ` Prasad Pandit
2026-04-07 17:40     ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.