Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Juraj Marcin <jmarcin@redhat.com>
To: Prasad Pandit <ppandit@redhat.com>
Cc: "Peter Xu" <peterx@redhat.com>,
	qemu-devel@nongnu.org, "Kirti Wankhede" <kwankhede@nvidia.com>,
	"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
	"Daniel P . Berrangé" <berrange@redhat.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	"Alex Williamson" <alex@shazbot.org>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Fabiano Rosas" <farosas@suse.de>,
	"Pranav Tyagi" <prtyagi@redhat.com>,
	"Zhiyi Guo" <zhguo@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Avihai Horon" <avihaih@nvidia.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	qemu-stable@nongnu.org
Subject: Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
Date: Tue, 31 Mar 2026 14:49:48 +0200	[thread overview]
Message-ID: <acuQc3dVq5vfh3cl@fedora> (raw)
In-Reply-To: <CAE8KmOwwFmrNgb8MWO7yprPmtautrbrWzF5FYdJ4ODboQxhwUw@mail.gmail.com>

Hi Prasad,

On 2026-03-30 17:22, Prasad Pandit wrote:
> Hello Juraj,
> 
> On Fri, 27 Mar 2026 at 20:05, Juraj Marcin <jmarcin@redhat.com> wrote:
> > > * What is the 'size' difference between < s->threshold_size  Vs  <=
> > > s->threshold_size?  Going through the source IIUC
> > > 1) 'pending_size' is measured in Bytes.
> > >      static void ram_state_pending_exact/_estimate()
> > >          remaining_size = rs->migration_dirty_pages *
> > > TARGET_PAGE_SIZE(=4096 bytes);
> > >          100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
> > > * 8 => 3,276,800 dirty bits
> > >
> > > 2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
> > > downtime(=300 ms)
> > >         100,000,000 bits/s => 100,000 bits/ms
> > >         100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
> > >         30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
> > >         s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
> > > transferred in 300ms downtime.
> > >
> > > * Are we comparing pending_size(=409600 bytes)  <=
> > > s->threshold_size(=30,000,000 bits)?
> >
> > While threshold_size is indeed derived from bandwidth, bandwidth is in
> > bytes:
> >
> >     current_bytes = migration_transferred_bytes();
> >     transferred = current_bytes - s->iteration_initial_bytes;
> >     time_spent = current_time - s->iteration_start_time;
> >     bandwidth = (double)transferred / time_spent;
> >
> > Conversion to bits only happens for the mbps statistic:
> >
> >     s->mbps = (((double) transferred * 8.0) /
> >                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
> >
> > >
> > > *  static void migration_update_counters()
> > >         transferred = current_bytes - s->iteration_initial_bytes;
> > >         bandwidth = (double)transferred / time_spent
> > >         if (switchover_bw) {
> > >             expected_bw_per_ms = (double)switchover_bw / 1000;
> > >         } else {
> > >             expected_bw_per_ms = bandwidth;
> > >         }
> > > => ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?
> >
> > switchover_bw is expected to be in bytes/sec, however, time_spent is
> > already in msec, thus bandwidth is also bytes/msec, the existing code is
> > correct.
> 
> * I see, this is not readily clear though. This needs good
> improvement. Maybe we should add an explanation in comments OR include
> an example calculation OR define helper function(s) to convert
> bandwidth from Mb/s <-> Mb/ms and MBps <-> Mbps etc.
> 
> >     s->mbps = (((double) transferred * 8.0) /
> >                ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
> 
> * If time_spent is in msec, continuing the above 100 mbps example, we
> got to 3,750,000 Bytes / 300 ms above. Now to get mbps
> 
>      s->mbps = (3,750,000 * 8.0)  / 300 (time_spent) bits/ms  =>
> 30,000,000 bits / 300ms
>      s->mbps = 30,000,000 bits * 1000ms (=1sec) / 300ms  =>  100,000,000 mbps.
> 
> *  (time_spent / 1000.0) / 1000.0 / 1000.0  is not very apparent.
> 
> > @Peter, not sure if it is necessary, but it could be usefull to mention
> > in MigrationParameters docs, that avail-switchover-bandwidth is in
> > bytes, not bits?
> 
> * I request that we use 'Mbps' notation for bandwidth at every user
> interface, be it --bandwidth <option> OR a definition in a
> configuration file OR documentation. Users should always specify and
> read bandwidth in 'Mbps'. Because that is the notation used
> everywhere. Asking users to specify 100,000,000 bps / 8 => 12,500,000
> Bytes/s  instead of 100Mbps does not seem right, is not user friendly.

QAPI schema defines the type of 'avail-switchover-bandwidth' and other
'*-bandwidth' properties as size, which allows users to use size suffix,
for example '100M'. QAPI parser then parses this value to
100 * 1000 * 1000 bytes.

Changing from bytes to bits is API breaking change, I don't think it is
justified.

> 
> > >
> > >       s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();
> > >
> > > migration_iteration_run():
> > >    /* Should we switch to postcopy now? */
> > >    if (must_precopy <= s->threshold_size &&
> > >       can_switchover && qatomic_read(&s->start_postcopy)) {
> > >       if (postcopy_start(s, &local_err)) {
> > >           migrate_error_propagate(s, error_copy(local_err));
> > >           error_report_err(local_err);
> > >       }
> > >       return MIG_ITERATE_SKIP;
> > >    }
> > > * Here we should check pending_size <= s->threshold_size,  because
> > > must_precopy is zero(0) when postcopy is enabled. And we switch to
> > > postcopy mode even when pending_size > s->threshold_size.
> > >   I wonder if we really need both 'must_precopy' and 'can_postcopy'
> > > variables, they seem to complicate things.
> >
> > With devices that implement pending method, don't support postcopy, and
> > are not yet migrated, must_precopy would not be zero.
> 
> * If must_precopy is not zero(0), then pending_data would not be <
> threshold_size either, right? What is must_precopy data? IIUC, all
> device state is 'must_precopy' data, because we don't send Postcopy
> requests for device state data. And can_postcopy data is only RAM
> pages.
> 
> > Both, must_precopy and can_postcopy are required, that is what allows
> > postcopy to switchover early. pending_size is the overall total that
> > includes also postcopiable data, hence why it is only used to trigger
> > precopy completion.
> >
> > However, the majority of devices don't implement pending methods (yet)
> > and thus are not counted towards the estimate even if they don't support
> > postcopy and affect the downtime.
> >
> > Wondering if VMSD devices could implement some pending estimates based
> > on their defined fields, this would also improve not violating the
> > downtime requirements.
> 
> * Using 'pending_size <= threshold_size'  in one place and using
> 'must_precopy <= threshold_size' in another place is confusing and
> inconsistent.
> 
> * What is threshold_size really? Number of bits/bytes we _can_
> transfer within the given downtime, right? if downtime is 300ms, at
> 100 mbps, threshold_size is 30Mb (ie. 10Mb/100ms). For 500ms,
> threshold_size is 50Mb.

Yes, that is right.

> 
> * Then it is easy to understand that when pending_size is <=
> threshold_size, then we can easily pause the source VM and switch to
> Postcopy, becasue those pending_size bits/bytes can be transferred in
> the given downtime. And if pending_size is more, then we _can not_
> pause the source VM, because we can not transfer more pending data
> within the given downtime value.

Devices report two pending values, 'must_precopy' and 'can_postcopy'.
'must_precopy' is number of bytes that must be migrated in precopy (or
switchover) and cannot be postcopied. 'can_postcopy' is number of bytes
that can be migrated either in precopy or postcopy, those can be RAM
pages (with postcopy-ram enabled) or dirty bitmaps or any device that
would support postcopy.

If switching to postcopy is allowed, we only need to be able to transfer
'must_precopy' bytes in switchover. Rest of the pending data can be
migrated in postcopy ('can_postcopy' bytes). Thus, we only compare
'must_precopy' when deciding whether to switch to postcopy.

However, if postcopy isn't allowed, and we are deciding whether to stop
the source machine and complete the migration, we need to migrate
everything in switchover, including the data that could be postcopied.
Thus, we need to compare 'pending_size = must_precopy + can_postcopy' to
decide if we can complete the migration in the specified downtime.

> 
> 
> Thank you.
> ---
>   - Prasad
>

next prev parent reply	other threads:[~2026-03-31 12:51 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 23:12 [PATCH RFC 00/12] migration/vfio: Fix a few issues on API misuse or statistic reports Peter Xu
2026-03-19 23:12 ` [PATCH RFC 01/12] migration: Fix low possibility downtime violation Peter Xu
2026-03-20 12:26   ` Prasad Pandit
2026-03-27 14:35     ` Juraj Marcin
2026-03-30 11:52       ` Prasad Pandit
2026-03-31 12:49         ` Juraj Marcin [this message]
2026-04-06  7:21           ` Prasad Pandit
2026-04-01 19:11       ` Peter Xu
2026-03-27 15:05   ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 02/12] migration/qapi: Rename MigrationStats to MigrationRAMStats Peter Xu
2026-03-19 23:26   ` Peter Xu
2026-03-20  6:54   ` Markus Armbruster
2026-04-01 19:38     ` Peter Xu
2026-04-01 19:47     ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 03/12] vfio/migration: Throttle vfio_save_block() on data size to read Peter Xu
2026-03-25 14:10   ` Avihai Horon
2026-04-01 20:36     ` Peter Xu
2026-04-06 11:21       ` Avihai Horon
2026-04-07 15:18         ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 04/12] vfio/migration: Cache stop size in VFIOMigration Peter Xu
2026-03-25 14:15   ` Avihai Horon
2026-04-01 20:41     ` Peter Xu
2026-04-06 11:28       ` Avihai Horon
2026-03-19 23:12 ` [PATCH RFC 05/12] migration/treewide: Merge @state_pending_{exact|estimate} APIs Peter Xu
2026-03-24 10:35   ` Prasad Pandit
2026-04-01 20:53     ` Peter Xu
2026-03-25 15:20   ` Avihai Horon
2026-04-01 21:22     ` Peter Xu
2026-04-06 11:54       ` Avihai Horon
2026-03-27 15:17   ` Juraj Marcin
2026-03-19 23:12 ` [PATCH RFC 06/12] migration: Use the new save_query_pending() API directly Peter Xu
2026-03-24  9:35   ` Prasad Pandit
2026-03-27 15:24   ` Juraj Marcin
2026-04-01 22:28     ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 07/12] migration: Introduce stopcopy_bytes in save_query_pending() Peter Xu
2026-03-24 11:05   ` Prasad Pandit
2026-03-25 16:54   ` Avihai Horon
2026-04-02 14:09     ` Peter Xu
2026-04-06 12:20       ` Avihai Horon
2026-04-07 15:30         ` Peter Xu
2026-03-27 16:43   ` Juraj Marcin
2026-04-02 15:16     ` Peter Xu
2026-04-07 15:19       ` Juraj Marcin
2026-04-07 15:32         ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 08/12] vfio/migration: Fix incorrect reporting for VFIO pending data Peter Xu
2026-03-25 17:32   ` Avihai Horon
2026-04-02 15:28     ` Peter Xu
2026-04-02 15:55       ` Peter Xu
2026-04-06 12:34         ` Avihai Horon
2026-04-07 15:45           ` Peter Xu
2026-03-19 23:12 ` [PATCH RFC 09/12] migration: Make iteration counter out of RAM Peter Xu
2026-03-20  6:12   ` Yong Huang
2026-03-20  9:49   ` Prasad Pandit
2026-04-02 15:35     ` Peter Xu
2026-03-27 16:49   ` Juraj Marcin
2026-04-02 15:42     ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 10/12] migration: Introduce a helper to return switchover bw estimate Peter Xu
2026-03-23 10:26   ` Prasad Pandit
2026-03-27 17:07   ` Juraj Marcin
2026-04-07 17:27     ` Peter Xu
2026-04-08 14:33       ` Juraj Marcin
2026-03-19 23:13 ` [PATCH RFC 11/12] migration: Calculate expected downtime on demand Peter Xu
2026-03-27 17:17   ` Juraj Marcin
2026-04-07 17:33     ` Peter Xu
2026-03-19 23:13 ` [PATCH RFC 12/12] migration: Fix calculation of expected_downtime to take VFIO info Peter Xu
2026-03-23 12:05   ` Prasad Pandit
2026-04-07 17:40     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acuQc3dVq5vfh3cl@fedora \
    --to=jmarcin@redhat.com \
    --cc=alex@shazbot.org \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=farosas@suse.de \
    --cc=joao.m.martins@oracle.com \
    --cc=kwankhede@nvidia.com \
    --cc=mail@maciej.szmigiero.name \
    --cc=peterx@redhat.com \
    --cc=ppandit@redhat.com \
    --cc=prtyagi@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    --cc=yishaih@nvidia.com \
    --cc=zhguo@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.