Re: [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Chuang Xu <xuchuangxclwt@bytedance.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, sgarzare@redhat.com,
	richard.henderson@linaro.org, pbonzini@redhat.com,
	david@kernel.org, philmd@linaro.org, farosas@suse.de
Subject: Re: [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily
Date: Mon, 15 Dec 2025 11:26:13 -0500	[thread overview]
Message-ID: <aUA2pYf68psZazPu@x1.local> (raw)
In-Reply-To: <20251215140611.16180-1-xuchuangxclwt@bytedance.com>

On Mon, Dec 15, 2025 at 10:06:10PM +0800, Chuang Xu wrote:
> In this version:
> 
> - drop duplicate vhost_log_sync optimization
> - refactor physical_memory_test_and_clear_dirty
> - provide more detailed bitmap sync time for each part in this cover
> 
> 
> In our long-term experience in Bytedance, we've found that under the same load,
> live migration of larger VMs with more devices is often more difficult to
> converge (requiring a larger downtime limit).
> 
> We've observed that the live migration bandwidth of large, multi-device VMs is
> severely distorted, a phenomenon likely similar to the problem described in this link
> (https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation).
> 
> Through some testing and calculations, we conclude that bitmap sync time affects
> the calculation of live migration bandwidth.
> 
> Now, let me use formulaic reasoning to illustrate the relationship between the downtime
> limit required to achieve the stop conditions and the bitmap sync time.
> 
> Assume the actual live migration bandwidth is B, the dirty page rate is D,
> the bitmap sync time is x (ms), the transfer time per iteration is t (ms), and the
> downtime limit is y (ms).
> 
> To simplify the calculation, we assume all of dirty pages are not zero page and only
> consider the case B > D.
> 
> When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t).
> When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / 100.
> 
> If there is a critical convergence state, then we have:
>   (1) B * t = D * (x + t)
>   (2) t = D * x / (B - D)
> For the stop condition to be successfully determined, then we have two cases:
>   When:
>   (3) x + t > 100
>   (4) x + D * x / (B - D) > 100
>   (5) x > 100 - 100 * D / B
>   Then:
>   (6) R * y > D * (x + t)
>   (7) B * t * y / (x + t) > D * (x + t)
>   (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / (B - D))
>   (9) D * y > D * (x + D * x / (B - D))
>   (10) y > x + D * x / (B - D)
>   (11) (B - D) * y > B * x
>   (12) y > B * x / (B - D)
> 
>   When:
>   (13) x + t < 100
>   (14) x + D * x / (B - D) < 100
>   (15) x < 100 - 100 * D / B
>   Then:
>   (16) R * y > D * (x + t)
>   (17) B * (100 - x) * y / 100 > D * (x + t)
>   (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D))
>   (19) y > 100 * D * x / ((B - D) * (100 - x))
> 
> After deriving the formula, we can use some data for comparison.
> 
> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 vhost-user-blk(4 queue per blk),
> the sync time is as high as *73ms* (tested with 10GBps dirty rate, the sync time increases as the dirty page rate increases),
> Here are each part of the sync time:
> 
> - sync from kvm to ram_list: 2.5ms
> - vhost_log_sync:3ms
> - sync aligned memory from ram_list to RAMBlock: 5ms
> - sync misaligned memory from ram_list to RAMBlock: 61ms
> 
> After applying this patch, syncing misaligned memory from ram_list to RAMBlock takes only about 1ms,
> and the total sync time is only *12ms*.

These numbers are greatly helpful, thanks a lot.  Please put that into the
commit message of the patch.

OTOH, IMHO you can drop the formula and bw calculation complexities.  Your
numbers here already justify this patch very useful.

I could have amended the commit message myself when queuing, but there's a
code change I want to double check with you.  I'll reply there soon.

> 
> *First case, assume our maximum bandwidth can reach 15GBps and the dirty page rate is 10GBps.
> 
> If x = 73 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 146 ms,
> because x + t = 219ms > 100ms,
> so we get y > B * x / (B - D) = 219ms.
> 
> If x = 12 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 24 ms,
> because x + t = 36ms < 100ms,
> so we get y > 100 * D * x / ((B - D) * (100 - x)) = 27.2ms.
> 
> We can see that after optimization, under the same bandwidth and dirty rate scenario,
> the downtime limit required for dirty page convergence is significantly reduced.
> 
> *Second case, assume our maximum bandwidth can reach 15GBps and the downtime limit is set to 150ms.
> If x = 73 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 73) / 150 = 7.7GBps,
> when x + t < 100ms,
> we use formula(19) get D < 5.35GBps
> 
> If x = 12 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 12) / 150 = 13.8GBps,
> when x + t < 100ms,
> we use formula(19) get D < 13.75GBps
> 
> We can see that after optimization, under the same bandwidth and downtime limit scenario,
> the convergent dirty page rate is significantly improved.
> 
> Through the above formula derivation, we have proven that reducing bitmap sync time
> can significantly improve dirty page convergence capability. 
> 
> This patch only optimizes bitmap sync time for part of scenarios.
> There may still be many scenarios where bitmap sync time negatively impacts dirty page
> convergence capability, and we can also try to optimize using this approach.
> 

-- 
Peter Xu

     prev parent reply	other threads:[~2025-12-15 16:27 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-15 14:06 [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
2025-12-15 14:06 ` [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-15 16:32   ` Peter Xu
2025-12-15 16:26 ` Peter Xu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aUA2pYf68psZazPu@x1.local \
    --to=peterx@redhat.com \
    --cc=david@kernel.org \
    --cc=farosas@suse.de \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=sgarzare@redhat.com \
    --cc=xuchuangxclwt@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).