Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Juraj Marcin <jmarcin@redhat.com>
To: Prasad Pandit <ppandit@redhat.com>, Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, "Kirti Wankhede" <kwankhede@nvidia.com>,
	"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
	"Daniel P . Berrangé" <berrange@redhat.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	"Alex Williamson" <alex@shazbot.org>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Fabiano Rosas" <farosas@suse.de>,
	"Pranav Tyagi" <prtyagi@redhat.com>,
	"Zhiyi Guo" <zhguo@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Avihai Horon" <avihaih@nvidia.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	qemu-stable@nongnu.org
Subject: Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
Date: Fri, 27 Mar 2026 15:35:35 +0100	[thread overview]
Message-ID: <acaNfFWvNAEpFsFD@fedora> (raw)
In-Reply-To: <CAE8KmOyTq4wib3niowjXNfqh4TjwhcEDqUt99=mvuLs5vsSx9A@mail.gmail.com>

Hi Prasad,

On 2026-03-20 17:56, Prasad Pandit wrote:
> On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> > When QEMU queried the estimated version of pending data and thinks it's
> > ready to converge, it'll send another accurate query to make sure of it.
> > It is needed to make sure we collect the latest reports and that equation
> > still holds true.
> >
> > However we missed one tiny little difference here on "<" v.s. "<=" when
> > comparing pending_size (A) to threshold_size (B)..
> >
> > QEMU src only re-query if A<B, but will kickoff switchover if A<=B.
> >
> > I think it means it is possible to happen if A (as an estimate only so far)
> > accidentally equals to B, then re-query won't happen and switchover will
> > proceed without considering new dirtied data.
> >
> > It turns out it was an accident in my commit 7aaa1fc072 when refactoring
> > the code around.  Fix this by using the same equation in both places.
> >
> > Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
> > Cc: qemu-stable@nongnu.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58..dfc60372cf 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> >           * postcopy started, so ESTIMATE should always match with EXACT
> >           * during postcopy phase.
> >           */
> > -        if (pending_size < s->threshold_size) {
> > +        if (pending_size <= s->threshold_size) {
> >              qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
> >              pending_size = must_precopy + can_postcopy;
> >              trace_migrate_pending_exact(pending_size, must_precopy,
> 
> * What is the 'size' difference between < s->threshold_size  Vs  <=
> s->threshold_size?  Going through the source IIUC
> 1) 'pending_size' is measured in Bytes.
>      static void ram_state_pending_exact/_estimate()
>          remaining_size = rs->migration_dirty_pages *
> TARGET_PAGE_SIZE(=4096 bytes);
>          100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
> * 8 => 3,276,800 dirty bits
> 
> 2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
> downtime(=300 ms)
>         100,000,000 bits/s => 100,000 bits/ms
>         100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
>         30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
>         s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
> transferred in 300ms downtime.
> 
> * Are we comparing pending_size(=409600 bytes)  <=
> s->threshold_size(=30,000,000 bits)?

While threshold_size is indeed derived from bandwidth, bandwidth is in
bytes:

    current_bytes = migration_transferred_bytes();
    transferred = current_bytes - s->iteration_initial_bytes;
    time_spent = current_time - s->iteration_start_time;
    bandwidth = (double)transferred / time_spent;

Conversion to bits only happens for the mbps statistic:

    s->mbps = (((double) transferred * 8.0) /
               ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;

> 
> *  static void migration_update_counters()
>         transferred = current_bytes - s->iteration_initial_bytes;
>         bandwidth = (double)transferred / time_spent
>         if (switchover_bw) {
>             expected_bw_per_ms = (double)switchover_bw / 1000;
>         } else {
>             expected_bw_per_ms = bandwidth;
>         }
> => ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?

switchover_bw is expected to be in bytes/sec, however, time_spent is
already in msec, thus bandwidth is also bytes/msec, the existing code is
correct.

@Peter, not sure if it is necessary, but it could be usefull to mention
in MigrationParameters docs, that avail-switchover-bandwidth is in
bytes, not bits?

> 
>       s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> 
> migration_iteration_run():
>    /* Should we switch to postcopy now? */
>    if (must_precopy <= s->threshold_size &&
>       can_switchover && qatomic_read(&s->start_postcopy)) {
>       if (postcopy_start(s, &local_err)) {
>           migrate_error_propagate(s, error_copy(local_err));
>           error_report_err(local_err);
>       }
>       return MIG_ITERATE_SKIP;
>    }
> * Here we should check pending_size <= s->threshold_size,  because
> must_precopy is zero(0) when postcopy is enabled. And we switch to
> postcopy mode even when pending_size > s->threshold_size.
>   I wonder if we really need both 'must_precopy' and 'can_postcopy'
> variables, they seem to complicate things.

With devices that implement pending method, don't support postcopy, and
are not yet migrated, must_precopy would not be zero.

Both, must_precopy and can_postcopy are required, that is what allows
postcopy to switchover early. pending_size is the overall total that
includes also postcopiable data, hence why it is only used to trigger
precopy completion.

However, the majority of devices don't implement pending methods (yet)
and thus are not counted towards the estimate even if they don't support
postcopy and affect the downtime.

Wondering if VMSD devices could implement some pending estimates based
on their defined fields, this would also improve not violating the
downtime requirements.

> ===
> # virsh migrate --verbose --live --auto-converge --postcopy
> --postcopy-after-precopy f42vm
> qemu+ssh://destination-machine.com/system
> # less /var/log/libvirt/qemu/f42vm.log
> ...
> migration_iteration_run: estimated pending_size: 50577408 bytes,
> s->threshold_size: 36282361
> migration_iteration_run: estimated pending_size: 43757568 bytes,
> s->threshold_size: 36282361
> migration_iteration_run: estimated pending_size: 36413440 bytes,
> s->threshold_size: 34334680
> migration_iteration_run: estimated pending_size: 29069312 bytes,
> s->threshold_size: 34334680
> 
> migration_iteration_run: exact pending_size: 4339167232 bytes, 0,
> 4339167232              <== exact size is calculated once.
> migration_iteration_run: estimated pending_size: 4332871680 bytes,
> s->threshold_size: 35651363
> migration_iteration_run: switching to postcopy: 4332871680, 0,
> 4332871680                    <== switch to postcopy with
> must_precopy(=0) <= s->threshold_size
> 
> migration_iteration_run: estimated pending_size: 4332892160 bytes,
> s->threshold_size: 35651363
> migration_iteration_run: estimated pending_size: 4323188736 bytes,
> s->threshold_size: 27243109
> migration_iteration_run: estimated pending_size: 4315320320 bytes,
> s->threshold_size: 27243109
> migration_iteration_run: estimated pending_size: 4308221952 bytes,
> s->threshold_size: 37695433
> ===
> * Here, the exact pending_size is calculated only once, because we
> switch to Postcopy mode even when pending_size is > s->threshold_size.
> 
> Thank you.
> ---
>   - Prasad
>

next prev parent reply	other threads:[~2026-03-27 14:37 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
2026-03-20 12:26   ` Prasad Pandit
2026-03-27 14:35     ` Juraj Marcin [this message]
2026-03-30 11:52       ` Prasad Pandit
2026-03-31 12:49         ` Juraj Marcin
2026-04-06  7:21           ` Prasad Pandit
2026-04-01 19:11       ` Peter Xu
2026-03-27 15:05   ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
2026-03-19 23:26   ` Peter Xu
2026-03-20  6:54   ` Markus Armbruster
2026-04-01 19:38     ` Peter Xu
2026-04-01 19:47     ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
2026-03-25 14:10   ` Avihai Horon
2026-04-01 20:36     ` Peter Xu
2026-04-06 11:21       ` Avihai Horon
2026-04-07 15:18         ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
2026-03-25 14:15   ` Avihai Horon
2026-04-01 20:41     ` Peter Xu
2026-04-06 11:28       ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
2026-03-24 10:35   ` Prasad Pandit
2026-04-01 20:53     ` Peter Xu
2026-03-25 15:20   ` Avihai Horon
2026-04-01 21:22     ` Peter Xu
2026-04-06 11:54       ` Avihai Horon
2026-03-27 15:17   ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
2026-03-24  9:35   ` Prasad Pandit
2026-03-27 15:24   ` Juraj Marcin
2026-04-01 22:28     ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
2026-03-24 11:05   ` Prasad Pandit
2026-03-25 16:54   ` Avihai Horon
2026-04-02 14:09     ` Peter Xu
2026-04-06 12:20       ` Avihai Horon
2026-04-07 15:30         ` Peter Xu
2026-03-27 16:43   ` Juraj Marcin
2026-04-02 15:16     ` Peter Xu
2026-04-07 15:19       ` Juraj Marcin
2026-04-07 15:32         ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
2026-03-25 17:32   ` Avihai Horon
2026-04-02 15:28     ` Peter Xu
2026-04-02 15:55       ` Peter Xu
2026-04-06 12:34         ` Avihai Horon
2026-04-07 15:45           ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
2026-03-20  6:12   ` Yong Huang
2026-03-20  9:49   ` Prasad Pandit
2026-04-02 15:35     ` Peter Xu
2026-03-27 16:49   ` Juraj Marcin
2026-04-02 15:42     ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
2026-03-23 10:26   ` Prasad Pandit
2026-03-27 17:07   ` Juraj Marcin
2026-04-07 17:27     ` Peter Xu
2026-04-08 14:33       ` Juraj Marcin
2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
2026-03-27 17:17   ` Juraj Marcin
2026-04-07 17:33     ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
2026-03-23 12:05   ` Prasad Pandit
2026-04-07 17:40     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acaNfFWvNAEpFsFD@fedora \
    --to=jmarcin@redhat.com \
    --cc=alex@shazbot.org \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=farosas@suse.de \
    --cc=joao.m.martins@oracle.com \
    --cc=kwankhede@nvidia.com \
    --cc=mail@maciej.szmigiero.name \
    --cc=peterx@redhat.com \
    --cc=ppandit@redhat.com \
    --cc=prtyagi@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    --cc=yishaih@nvidia.com \
    --cc=zhguo@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.