From: Peter Xu <peterx@redhat.com>
To: Chuang Xu <xuchuangxclwt@bytedance.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, sgarzare@redhat.com,
richard.henderson@linaro.org, pbonzini@redhat.com,
david@kernel.org, philmd@linaro.org, farosas@suse.de
Subject: Re: [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily
Date: Mon, 15 Dec 2025 11:26:13 -0500 [thread overview]
Message-ID: <aUA2pYf68psZazPu@x1.local> (raw)
In-Reply-To: <20251215140611.16180-1-xuchuangxclwt@bytedance.com>
On Mon, Dec 15, 2025 at 10:06:10PM +0800, Chuang Xu wrote:
> In this version:
>
> - drop duplicate vhost_log_sync optimization
> - refactor physical_memory_test_and_clear_dirty
> - provide more detailed bitmap sync time for each part in this cover
>
>
> In our long-term experience in Bytedance, we've found that under the same load,
> live migration of larger VMs with more devices is often more difficult to
> converge (requiring a larger downtime limit).
>
> We've observed that the live migration bandwidth of large, multi-device VMs is
> severely distorted, a phenomenon likely similar to the problem described in this link
> (https://wiki.qemu.org/ToDo/LiveMigration#Optimize_migration_bandwidth_calculation).
>
> Through some testing and calculations, we conclude that bitmap sync time affects
> the calculation of live migration bandwidth.
>
> Now, let me use formulaic reasoning to illustrate the relationship between the downtime
> limit required to achieve the stop conditions and the bitmap sync time.
>
> Assume the actual live migration bandwidth is B, the dirty page rate is D,
> the bitmap sync time is x (ms), the transfer time per iteration is t (ms), and the
> downtime limit is y (ms).
>
> To simplify the calculation, we assume all of dirty pages are not zero page and only
> consider the case B > D.
>
> When x + t > 100ms, the bandwidth calculated by qemu is R = B * t / (x + t).
> When x + t < 100ms, the bandwidth calculated by qemu is R = B * (100 - x) / 100.
>
> If there is a critical convergence state, then we have:
> (1) B * t = D * (x + t)
> (2) t = D * x / (B - D)
> For the stop condition to be successfully determined, then we have two cases:
> When:
> (3) x + t > 100
> (4) x + D * x / (B - D) > 100
> (5) x > 100 - 100 * D / B
> Then:
> (6) R * y > D * (x + t)
> (7) B * t * y / (x + t) > D * (x + t)
> (8) (B * (D * x / (B - D)) * y) / (x + D * x / (B - D)) > D * (x + D * x / (B - D))
> (9) D * y > D * (x + D * x / (B - D))
> (10) y > x + D * x / (B - D)
> (11) (B - D) * y > B * x
> (12) y > B * x / (B - D)
>
> When:
> (13) x + t < 100
> (14) x + D * x / (B - D) < 100
> (15) x < 100 - 100 * D / B
> Then:
> (16) R * y > D * (x + t)
> (17) B * (100 - x) * y / 100 > D * (x + t)
> (18) B * (100 - x) * y / 100 > D * (x + D * x / (B - D))
> (19) y > 100 * D * x / ((B - D) * (100 - x))
>
> After deriving the formula, we can use some data for comparison.
>
> For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and 16 vhost-user-blk(4 queue per blk),
> the sync time is as high as *73ms* (tested with 10GBps dirty rate, the sync time increases as the dirty page rate increases),
> Here are each part of the sync time:
>
> - sync from kvm to ram_list: 2.5ms
> - vhost_log_sync:3ms
> - sync aligned memory from ram_list to RAMBlock: 5ms
> - sync misaligned memory from ram_list to RAMBlock: 61ms
>
> After applying this patch, syncing misaligned memory from ram_list to RAMBlock takes only about 1ms,
> and the total sync time is only *12ms*.
These numbers are greatly helpful, thanks a lot. Please put that into the
commit message of the patch.
OTOH, IMHO you can drop the formula and bw calculation complexities. Your
numbers here already justify this patch very useful.
I could have amended the commit message myself when queuing, but there's a
code change I want to double check with you. I'll reply there soon.
>
> *First case, assume our maximum bandwidth can reach 15GBps and the dirty page rate is 10GBps.
>
> If x = 73 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 146 ms,
> because x + t = 219ms > 100ms,
> so we get y > B * x / (B - D) = 219ms.
>
> If x = 12 ms, when there is a critical convergence state,
> we use formula(2) get t = D * x / (B - D) = 24 ms,
> because x + t = 36ms < 100ms,
> so we get y > 100 * D * x / ((B - D) * (100 - x)) = 27.2ms.
>
> We can see that after optimization, under the same bandwidth and dirty rate scenario,
> the downtime limit required for dirty page convergence is significantly reduced.
>
> *Second case, assume our maximum bandwidth can reach 15GBps and the downtime limit is set to 150ms.
> If x = 73 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 73) / 150 = 7.7GBps,
> when x + t < 100ms,
> we use formula(19) get D < 5.35GBps
>
> If x = 12 ms,
> when x + t > 100ms,
> we use formula(12) get D < B * (y - x) / y = 15 * (150 - 12) / 150 = 13.8GBps,
> when x + t < 100ms,
> we use formula(19) get D < 13.75GBps
>
> We can see that after optimization, under the same bandwidth and downtime limit scenario,
> the convergent dirty page rate is significantly improved.
>
> Through the above formula derivation, we have proven that reducing bitmap sync time
> can significantly improve dirty page convergence capability.
>
> This patch only optimizes bitmap sync time for part of scenarios.
> There may still be many scenarios where bitmap sync time negatively impacts dirty page
> convergence capability, and we can also try to optimize using this approach.
>
--
Peter Xu
prev parent reply other threads:[~2025-12-15 16:27 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-15 14:06 [PATCH v2 0/1] migration: reduce bitmap sync time and make dirty pages converge much more easily Chuang Xu
2025-12-15 14:06 ` [PATCH v2 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-15 16:32 ` Peter Xu
2025-12-15 16:26 ` Peter Xu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aUA2pYf68psZazPu@x1.local \
--to=peterx@redhat.com \
--cc=david@kernel.org \
--cc=farosas@suse.de \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=sgarzare@redhat.com \
--cc=xuchuangxclwt@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).