From: Peter Xu <peterx@redhat.com>
To: Avihai Horon <avihaih@nvidia.com>
Cc: qemu-devel@nongnu.org, "Juraj Marcin" <jmarcin@redhat.com>,
"Kirti Wankhede" <kwankhede@nvidia.com>,
"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
"Daniel P . Berrangé" <berrange@redhat.com>,
"Joao Martins" <joao.m.martins@oracle.com>,
"Alex Williamson" <alex@shazbot.org>,
"Yishai Hadas" <yishaih@nvidia.com>,
"Fabiano Rosas" <farosas@suse.de>,
"Pranav Tyagi" <prtyagi@redhat.com>,
"Zhiyi Guo" <zhguo@redhat.com>,
"Markus Armbruster" <armbru@redhat.com>,
"Cédric Le Goater" <clg@redhat.com>,
"Halil Pasic" <pasic@linux.ibm.com>,
"Christian Borntraeger" <borntraeger@linux.ibm.com>,
"Jason Herne" <jjherne@linux.ibm.com>,
"Eric Farman" <farman@linux.ibm.com>,
"Matthew Rosato" <mjrosato@linux.ibm.com>,
"Richard Henderson" <richard.henderson@linaro.org>,
"Ilya Leoshkevich" <iii@linux.ibm.com>,
"David Hildenbrand" <david@kernel.org>,
"Cornelia Huck" <cohuck@redhat.com>,
"Eric Blake" <eblake@redhat.com>,
"Vladimir Sementsov-Ogievskiy" <vsementsov@yandex-team.ru>,
"John Snow" <jsnow@redhat.com>
Subject: Re: [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs
Date: Wed, 1 Apr 2026 17:22:18 -0400 [thread overview]
Message-ID: <ac2MitFie3jXyiRk@x1.local> (raw)
In-Reply-To: <b957ae39-12e0-40d2-b0b3-8f09bff10bf0@nvidia.com>
On Wed, Mar 25, 2026 at 05:20:41PM +0200, Avihai Horon wrote:
>
> On 3/20/2026 1:12 AM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > These two APIs are a slight duplication. For example, there're a few users
> > that directly pass in the same function.
> >
> > It might also be slightly error prone to provide two hooks, so that it's
> > easier to happen that one module report different things via the two
> > hooks. In reality they should always report the same thing, only about
> > whether we should use a fast-path when the slow path might be too slow, and
> > even if we need to pay with some less accuracy.
> >
> > Let's just merge it into one API, but instead provide a bool showing if the
> > query is a fast query or not.
> >
> > No functional change intended.
> >
> > Export qemu_savevm_query_pending(). We should likely directly use the new
> > API here provided when there're new users to do the query. This will
> > happen very soon.
> >
> > Cc: Halil Pasic <pasic@linux.ibm.com>
> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> > Cc: Jason Herne <jjherne@linux.ibm.com>
> > Cc: Eric Farman <farman@linux.ibm.com>
> > Cc: Matthew Rosato <mjrosato@linux.ibm.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>
> > Cc: Ilya Leoshkevich <iii@linux.ibm.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> > Cc: John Snow <jsnow@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > docs/devel/migration/main.rst | 5 ++-
> > docs/devel/migration/vfio.rst | 9 ++----
> > include/migration/register.h | 52 ++++++++++--------------------
> > migration/savevm.h | 3 ++
> > hw/s390x/s390-stattrib.c | 8 ++---
> > hw/vfio/migration.c | 58 +++++++++++++++++++---------------
> > migration/block-dirty-bitmap.c | 9 ++----
> > migration/ram.c | 32 ++++++-------------
> > migration/savevm.c | 43 ++++++++++++-------------
> > hw/vfio/trace-events | 3 +-
> > 10 files changed, 93 insertions(+), 129 deletions(-)
> >
> > diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> > index 234d280249..22c5910d5c 100644
> > --- a/docs/devel/migration/main.rst
> > +++ b/docs/devel/migration/main.rst
> > @@ -519,9 +519,8 @@ An iterative device must provide:
> > data we must save. The core migration code will use this to
> > determine when to pause the CPUs and complete the migration.
>
> We should also remove the state_pending_exact section above.
Ah yes, will do.
>
> >
> > - - A ``state_pending_estimate`` function that indicates how much more
> > - data we must save. When the estimated amount is smaller than the
> > - threshold, we call ``state_pending_exact``.
> > + - A ``save_query_pending`` function that indicates how much more
> > + data we must save.
> >
> > - A ``save_live_iterate`` function should send a chunk of data until
> > the point that stream bandwidth limits tell it to stop. Each call
> > diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
> > index 0790e5031d..33768c877c 100644
> > --- a/docs/devel/migration/vfio.rst
> > +++ b/docs/devel/migration/vfio.rst
> > @@ -50,13 +50,8 @@ VFIO implements the device hooks for the iterative approach as follows:
> > * A ``load_setup`` function that sets the VFIO device on the destination in
> > _RESUMING state.
> >
> > -* A ``state_pending_estimate`` function that reports an estimate of the
> > - remaining pre-copy data that the vendor driver has yet to save for the VFIO
> > - device.
> > -
> > -* A ``state_pending_exact`` function that reads pending_bytes from the vendor
> > - driver, which indicates the amount of data that the vendor driver has yet to
> > - save for the VFIO device.
> > +* A ``save_query_pending`` function that reports the remaining pre-copy
> > + data that the vendor driver has yet to save for the VFIO device.
> >
> > * An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> > active only when the VFIO device is in pre-copy states.
> > diff --git a/include/migration/register.h b/include/migration/register.h
> > index d0f37f5f43..2320c3a981 100644
> > --- a/include/migration/register.h
> > +++ b/include/migration/register.h
> > @@ -16,6 +16,15 @@
> >
> > #include "hw/core/vmstate-if.h"
> >
> > +typedef struct MigPendingData {
> > + /* How many bytes are pending for precopy / stopcopy? */
> > + uint64_t precopy_bytes;
> > + /* How many bytes are pending that can be transferred in postcopy? */
> > + uint64_t postcopy_bytes;
> > + /* Is this a fastpath query (which can be inaccurate)? */
> > + bool fastpath;
> > +} MigPendingData ;
>
> Extra space before semicolon.
Fixed.
>
> > +
> > /**
> > * struct SaveVMHandlers: handler structure to finely control
> > * migration of complex subsystems and devices, such as RAM, block and
> > @@ -197,46 +206,17 @@ typedef struct SaveVMHandlers {
> > bool (*save_postcopy_prepare)(QEMUFile *f, void *opaque, Error **errp);
> >
> > /**
> > - * @state_pending_estimate
> > - *
> > - * This estimates the remaining data to transfer
> > - *
> > - * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > - * pending data.
> > - *
> > - * @opaque: data pointer passed to register_savevm_live()
> > - * @must_precopy: amount of data that must be migrated in precopy
> > - * or in stopped state, i.e. that must be migrated
> > - * before target start.
> > - * @can_postcopy: amount of data that can be migrated in postcopy
> > - * or in stopped state, i.e. after target start.
> > - * Some can also be migrated during precopy (RAM).
> > - * Some must be migrated after source stops
> > - * (block-dirty-bitmap)
> > - */
> > - void (*state_pending_estimate)(void *opaque, uint64_t *must_precopy,
> > - uint64_t *can_postcopy);
> > -
> > - /**
> > - * @state_pending_exact
> > - *
> > - * This calculates the exact remaining data to transfer
> > + * @save_query_pending
> > *
> > - * Sum of @can_postcopy and @must_postcopy is the whole amount of
> > - * pending data.
> > + * This estimates the remaining data to transfer on the source side.
> > + * It's highly suggested that the module should implement both fastpath
> > + * and slowpath version of it when it can be slow (for more information
> > + * please check pending->fastpath field).
> > *
> > * @opaque: data pointer passed to register_savevm_live()
> > - * @must_precopy: amount of data that must be migrated in precopy
> > - * or in stopped state, i.e. that must be migrated
> > - * before target start.
> > - * @can_postcopy: amount of data that can be migrated in postcopy
> > - * or in stopped state, i.e. after target start.
> > - * Some can also be migrated during precopy (RAM).
> > - * Some must be migrated after source stops
> > - * (block-dirty-bitmap)
> > + * @pending: pointer to a MigPendingData struct
> > */
> > - void (*state_pending_exact)(void *opaque, uint64_t *must_precopy,
> > - uint64_t *can_postcopy);
> > + void (*save_query_pending)(void *opaque, MigPendingData *pending);
> >
> > /**
> > * @load_state
> > diff --git a/migration/savevm.h b/migration/savevm.h
> > index b3d1e8a13c..b116933bce 100644
> > --- a/migration/savevm.h
> > +++ b/migration/savevm.h
> > @@ -14,6 +14,8 @@
> > #ifndef MIGRATION_SAVEVM_H
> > #define MIGRATION_SAVEVM_H
> >
> > +#include "migration/register.h"
> > +
> > #define QEMU_VM_FILE_MAGIC 0x5145564d
> > #define QEMU_VM_FILE_VERSION_COMPAT 0x00000002
> > #define QEMU_VM_FILE_VERSION 0x00000003
> > @@ -43,6 +45,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> > void qemu_savevm_state_cleanup(void);
> > void qemu_savevm_state_complete_postcopy(QEMUFile *f);
> > int qemu_savevm_state_complete_precopy(MigrationState *s);
> > +void qemu_savevm_query_pending(MigPendingData *pending, bool fastpath);
> > void qemu_savevm_state_pending_exact(uint64_t *must_precopy,
> > uint64_t *can_postcopy);
> > void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> > diff --git a/hw/s390x/s390-stattrib.c b/hw/s390x/s390-stattrib.c
> > index d808ece3b9..b1ec51c77a 100644
> > --- a/hw/s390x/s390-stattrib.c
> > +++ b/hw/s390x/s390-stattrib.c
> > @@ -187,15 +187,14 @@ static int cmma_save_setup(QEMUFile *f, void *opaque, Error **errp)
> > return 0;
> > }
> >
> > -static void cmma_state_pending(void *opaque, uint64_t *must_precopy,
> > - uint64_t *can_postcopy)
> > +static void cmma_state_pending(void *opaque, MigPendingData *pending)
> > {
> > S390StAttribState *sas = S390_STATTRIB(opaque);
> > S390StAttribClass *sac = S390_STATTRIB_GET_CLASS(sas);
> > long long res = sac->get_dirtycount(sas);
> >
> > if (res >= 0) {
> > - *must_precopy += res;
> > + pending->precopy_bytes += res;
> > }
> > }
> >
> > @@ -340,8 +339,7 @@ static SaveVMHandlers savevm_s390_stattrib_handlers = {
> > .save_setup = cmma_save_setup,
> > .save_live_iterate = cmma_save_iterate,
> > .save_complete = cmma_save_complete,
> > - .state_pending_exact = cmma_state_pending,
> > - .state_pending_estimate = cmma_state_pending,
> > + .save_query_pending = cmma_state_pending,
> > .save_cleanup = cmma_save_cleanup,
> > .load_state = cmma_load,
> > .is_active = cmma_active,
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 827d3ded63..c054c749b0 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -570,42 +570,51 @@ static void vfio_save_cleanup(void *opaque)
> > trace_vfio_save_cleanup(vbasedev->name);
> > }
> >
> > -static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> > - uint64_t *can_postcopy)
> > +static void vfio_state_pending_sync(VFIODevice *vbasedev)
> > {
> > - VFIODevice *vbasedev = opaque;
> > VFIOMigration *migration = vbasedev->migration;
> >
> > - if (!vfio_device_state_is_precopy(vbasedev)) {
> > - return;
> > - }
> > + vfio_query_stop_copy_size(vbasedev);
> >
> > - *must_precopy +=
> > - migration->precopy_init_size + migration->precopy_dirty_size;
> > + if (vfio_device_state_is_precopy(vbasedev)) {
> > + vfio_query_precopy_size(migration);
> > + }
> >
> > - trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> > - *can_postcopy,
> > - migration->precopy_init_size,
> > - migration->precopy_dirty_size);
> > + /*
> > + * In all cases, all PRECOPY data should be no more than STOPCOPY data.
> > + * Otherwise we have a problem. So far, only dump some errors.
> > + */
>
> I'm not sure that must be true, as stopcopy and precopy queries are not
> atomic, so device state size can change between the queries.
I believe I wrote this only based on my (proven wrong) understanding of
stopcopy size, which I thought was an upper bound. It looks not.. If so,
yes, atomicity will be a problem..
Is below all the doc I can find regarding to this ioctl? vfio.h has:
/*
* Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
* be required to complete stop copy.
*
* Note: Can be called on each device state.
*/
struct vfio_device_feature_mig_data_size {
__aligned_u64 stop_copy_length;
};
#define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9
I understand all driver authors want to be flexible, but I wonder if I'm
the only person keep getting confused by the VFIO migration API on
providing estimated results all over the places, and it'll be definitely
more confusing if it can shrink. We can leave this part of discussion in
the other thread.
I don't think this check is a must; I can simply remove it, however please
help me to understand better on this interface.
Another note is that the goal of this series is to provide any way so the
mgmt can query pending information when VFIO is involved during migration.
This series only exposed (hopefully, much more accurate) expected_downtime.
When repost, I plan to add a system-wise remaining data field so the mgmt
can fetch for each of the iterations to predict any possible unconvergence,
for example, query remaining data and if it stops shrinking for a while it
is a simple sign of possible not converging. It means either downtime
needs adjustments or something else (cancellation is also an option).
Avihai, if you think this goal will also be part of what you or your team
will be looking for, please let me know. I am always open if you or
someone from your team thinks it easier to take this over as you know
better on this area. Otherwise I'll respin after I figure out the rest
puzzles with your help.
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2026-04-01 21:22 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
2026-03-20 12:26 ` Prasad Pandit
2026-03-27 14:35 ` Juraj Marcin
2026-03-30 11:52 ` Prasad Pandit
2026-03-31 12:49 ` Juraj Marcin
2026-04-06 7:21 ` Prasad Pandit
2026-04-01 19:11 ` Peter Xu
2026-03-27 15:05 ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
2026-03-19 23:26 ` Peter Xu
2026-03-20 6:54 ` Markus Armbruster
2026-04-01 19:38 ` Peter Xu
2026-04-01 19:47 ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
2026-03-25 14:10 ` Avihai Horon
2026-04-01 20:36 ` Peter Xu
2026-04-06 11:21 ` Avihai Horon
2026-04-07 15:18 ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
2026-03-25 14:15 ` Avihai Horon
2026-04-01 20:41 ` Peter Xu
2026-04-06 11:28 ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
2026-03-24 10:35 ` Prasad Pandit
2026-04-01 20:53 ` Peter Xu
2026-03-25 15:20 ` Avihai Horon
2026-04-01 21:22 ` Peter Xu [this message]
2026-04-06 11:54 ` Avihai Horon
2026-03-27 15:17 ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
2026-03-24 9:35 ` Prasad Pandit
2026-03-27 15:24 ` Juraj Marcin
2026-04-01 22:28 ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
2026-03-24 11:05 ` Prasad Pandit
2026-03-25 16:54 ` Avihai Horon
2026-04-02 14:09 ` Peter Xu
2026-04-06 12:20 ` Avihai Horon
2026-04-07 15:30 ` Peter Xu
2026-03-27 16:43 ` Juraj Marcin
2026-04-02 15:16 ` Peter Xu
2026-04-07 15:19 ` Juraj Marcin
2026-04-07 15:32 ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
2026-03-25 17:32 ` Avihai Horon
2026-04-02 15:28 ` Peter Xu
2026-04-02 15:55 ` Peter Xu
2026-04-06 12:34 ` Avihai Horon
2026-04-07 15:45 ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
2026-03-20 6:12 ` Yong Huang
2026-03-20 9:49 ` Prasad Pandit
2026-04-02 15:35 ` Peter Xu
2026-03-27 16:49 ` Juraj Marcin
2026-04-02 15:42 ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
2026-03-23 10:26 ` Prasad Pandit
2026-03-27 17:07 ` Juraj Marcin
2026-04-07 17:27 ` Peter Xu
2026-04-08 14:33 ` Juraj Marcin
2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
2026-03-27 17:17 ` Juraj Marcin
2026-04-07 17:33 ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
2026-03-23 12:05 ` Prasad Pandit
2026-04-07 17:40 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac2MitFie3jXyiRk@x1.local \
--to=peterx@redhat.com \
--cc=alex@shazbot.org \
--cc=armbru@redhat.com \
--cc=avihaih@nvidia.com \
--cc=berrange@redhat.com \
--cc=borntraeger@linux.ibm.com \
--cc=clg@redhat.com \
--cc=cohuck@redhat.com \
--cc=david@kernel.org \
--cc=eblake@redhat.com \
--cc=farman@linux.ibm.com \
--cc=farosas@suse.de \
--cc=iii@linux.ibm.com \
--cc=jjherne@linux.ibm.com \
--cc=jmarcin@redhat.com \
--cc=joao.m.martins@oracle.com \
--cc=jsnow@redhat.com \
--cc=kwankhede@nvidia.com \
--cc=mail@maciej.szmigiero.name \
--cc=mjrosato@linux.ibm.com \
--cc=pasic@linux.ibm.com \
--cc=prtyagi@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=vsementsov@yandex-team.ru \
--cc=yishaih@nvidia.com \
--cc=zhguo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.