Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Chuang Xu <xuchuangxclwt@bytedance.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
	mst@redhat.com, sgarzare@redhat.com,
	richard.henderson@linaro.org, pbonzini@redhat.com,
	david@kernel.org, philmd@linaro.org
Subject: Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
Date: Wed, 17 Dec 2025 09:59:11 -0500	[thread overview]
Message-ID: <aULFP1kbeT2yceiV@x1.local> (raw)
In-Reply-To: <65dc5a3d-fe3f-48d9-b7e8-c04346308fa8@bytedance.com>

On Wed, Dec 17, 2025 at 09:43:24PM +0800, Chuang Xu wrote:
> On 17/12/2025 21:21, Peter Xu wrote:
> > On Wed, Dec 17, 2025 at 02:46:58PM +0800, Chuang Xu wrote:
> >> On 17/12/2025 00:26, Peter Xu wrote:
> >>> On Tue, Dec 16, 2025 at 10:25:46AM -0300, Fabiano Rosas wrote:
> >>>> "Chuang Xu" <xuchuangxclwt@bytedance.com> writes:
> >>>>
> >>>>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >>>>>
> >>>>> In our long-term experience in Bytedance, we've found that under
> >>>>> the same load, live migration of larger VMs with more devices is
> >>>>> often more difficult to converge (requiring a larger downtime limit).
> >>>>>
> >>>>> Through some testing and calculations, we conclude that bitmap sync time
> >>>>> affects the calculation of live migration bandwidth.
> >>> Side note:
> >>>
> >>> I forgot to mention when replying to the old versions, but we introduced
> >>> avail-switchover-bandwidth to partially remedy this problem when we hit it
> >>> before - which may or may not be exactly the same reason here on unaligned
> >>> syncs as we didn't further investigate (we have VFIO-PCI devices when
> >>> testing), but the whole logic should be similar that bw was calculated too
> >>> small.
> >> In bytedance, we also migrate vms with vfio devices, which also suffer from
> >> the issue of long vfio bitmap sync time for large vm.
> >>> So even if with this patch optimizing sync, bw is always not as accurate.
> >>> I wonder if we can still fix it somehow, e.g. I wonder if 100ms is too
> >>> short a period to take samples, or at least we should be able to remember
> >>> more samples so the reported bw (even if we keep sampling per 100ms) will
> >>> cover longer period.
> >>>
> >>> Feel free to share your thoughts if you have any.
> >>>
> >> FYI:
> >> Initially, when I encountered the problem of large vm migration hard to
> >> converge,
> >> I tried subtracting the bitmap sync time from the bandwidth calculation,
> >> which alleviated the problem somewhat. However, through formula calculation,
> >> I found that this did not completely solve the problem. Therefore, I
> > If you ruled out sync time, why the bw is still not accurate?  Have you
> > investigated that?
> >
> > Maybe there's something else happening besides the sync period you
> > excluded.
> 
> Referring to the formula I wrote in the cover, after subtracting sync time,
> 
> we get the prerequisite that R=B. Substituting this condition into the
> 
> subsequent formula derivation(B * t = D * (x + t) and R * y > D * (x + t)),
> 
> we will eventually get y > D * x / (B - D).
> 
> This means that even if our bandwidth calculations are correct,
> 
> the sync time can still affect our judgment of downtime conditions.

Right, it will, because any time used for sync has the vCPUs running, so
that will contributes to the total dirtied pages, hence partly increase D,
as you pointed out.

But my point is, if you _really_ have R=B all right, you should e.g. on a
10Gbps NIC seeing R~=10Gbps.  If R is not wire speed, it means the R is not
really correctly measured..

I think it's likely impossible to measure the correct R so that it'll equal
to B, however IMHO we can still think about something that makes the R
getting much closer to B, then when normally y is a constant (default
300ms, for example) it'll start to converge where it used to not be able to.

E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO
it'll be much better and start solving a lot of such problems if it can
start to report at least a few Gbps based on all kinds of methods
(e.g. excluding sync, as you experimented), then even if it's not reporting
10Gbps it'll help.

-- 
Peter Xu

next prev parent reply	other threads:[~2025-12-17 14:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-16  8:00 [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-16 13:25 ` Fabiano Rosas
2025-12-16 16:26   ` Peter Xu
2025-12-17  6:46     ` Chuang Xu
2025-12-17 13:21       ` Peter Xu
2025-12-17 13:43         ` Chuang Xu
2025-12-17 14:59           ` Peter Xu [this message]
2025-12-18  9:20             ` Chuang Xu
2025-12-18 15:32               ` Peter Xu
2025-12-17 17:01 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aULFP1kbeT2yceiV@x1.local \
    --to=peterx@redhat.com \
    --cc=david@kernel.org \
    --cc=farosas@suse.de \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=sgarzare@redhat.com \
    --cc=xuchuangxclwt@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.