Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Chuang Xu <xuchuangxclwt@bytedance.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
	mst@redhat.com, sgarzare@redhat.com,
	richard.henderson@linaro.org, pbonzini@redhat.com,
	david@kernel.org, philmd@linaro.org
Subject: Re: [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls
Date: Thu, 18 Dec 2025 10:32:48 -0500	[thread overview]
Message-ID: <aUQeoNveybyICXjD@x1.local> (raw)
In-Reply-To: <82ca276d-831d-4e19-96e2-d88a7f94a430@bytedance.com>

On Thu, Dec 18, 2025 at 05:20:19PM +0800, Chuang Xu wrote:
> On 17/12/2025 22:59, Peter Xu wrote:
> > Right, it will, because any time used for sync has the vCPUs running, so
> > that will contributes to the total dirtied pages, hence partly increase D,
> > as you pointed out.
> >
> > But my point is, if you _really_ have R=B all right, you should e.g. on a
> > 10Gbps NIC seeing R~=10Gbps.  If R is not wire speed, it means the R is not
> > really correctly measured..
> 
> In my experience, the bandwidth of live migration usually doesn't reach
> the nic's bandwidth limit (my test environment's nic bandwidth limit is 200Gbps).
> This could be due to various reasons: for example, the live migration main thread's
> ability to search for dirty pages may have reached a bottleneck;
> the nic's interrupt binding range might limit the softirq's processing capacity;
> there might be too few multifd threads; or there might be overhead in synchronizing
> between the live migration main thread and the multifd thread.

Exactly, especially when you have 200Gbps NICs.

I hope I have some of those for testing too!  I don't, so I can't provide
really useful input..  My vague memory (I got some chance using a 100Gbps
NIC, if I recall correctly) is that main thread will bottleneck already
there, where I should have (maybe?) 8 multifd threads.

I just never knew whether we need to scale it out yet so far, normally
100G/200G setup only happens with direct attached, not a major use case for
cluster setup?  Or maybe I am outdated?

If that'll be a major use case at some point, and if main thread is the
bottleneck distributing things, then we need to scale it out.  I think it's
doable.

> 
> >
> > I think it's likely impossible to measure the correct R so that it'll equal
> > to B, however IMHO we can still think about something that makes the R
> > getting much closer to B, then when normally y is a constant (default
> > 300ms, for example) it'll start to converge where it used to not be able to.
> 
> Yes, there are always various factors that can cause measurement errors.
> We can only try to make the calculated value as close as possible to the actual value.
> 
> > E.g. QEMU can currently report R as low as 10Mbps even if on 10Gbps, IMHO
> > it'll be much better and start solving a lot of such problems if it can
> > start to report at least a few Gbps based on all kinds of methods
> > (e.g. excluding sync, as you experimented), then even if it's not reporting
> > 10Gbps it'll help.
> >
> After I applied these optimizations, typically the bandwidth statistics
> from QEMU and the real-time nic bandwidth monitored by atop are close.
> 
> Those extremely low bandwidth(but consistent with atop monitoring) is usually
> caused by zero pages or dirty pages with extremely high compression rates.
> In these cases, QEMU uses very little nic bandwidth to transmit a large number
> of dirty pages, but the bandwidth is only calculated based on the actual
> amount of data transmitted.

Yes.  That's a major issue in QEMU, zero page / compressed page / ... not
only affects how QEMU "measures" the mbps, but also affects how QEMU
decides when to converge: here I'm not talking about the bw difference
causing "bw * downtime_limit" [A] too small.  I'm talking about the other
side of equation where we used [A] to compare with "remain_dirty_pages *
psize" [B].  In reality, [B] isn't accurate either when zero page /
compressed page / ... is used..

Maybe.. the switchover decision shouldn't be MBps as unit, but "number of
pages".  It'll remove most of those effects at least, but that needs some
more considerations..

> 
> If we want to use the actual number of dirty pages transmitted to calculate
> bandwidth, we face another risk: if the dirty pages transmitted before the
> downtime have a high compression ratio, and the dirty pages to be transmitted
> after the downtime have a low compression ratio, then the downtime will far
> exceed expectations.

... like what you mentioned here will also be an issue if we switch to use
n_pages to do the math. :)

> 
> This may have strayed a bit, but just providing some potentially useful information
> from my perspective.

Not really; patch alone is good, I appreciate the discussions.

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2025-12-18 15:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-16  8:00 [PATCH v3 1/1] migration: merge fragmented clear_dirty ioctls Chuang Xu
2025-12-16 13:25 ` Fabiano Rosas
2025-12-16 16:26   ` Peter Xu
2025-12-17  6:46     ` Chuang Xu
2025-12-17 13:21       ` Peter Xu
2025-12-17 13:43         ` Chuang Xu
2025-12-17 14:59           ` Peter Xu
2025-12-18  9:20             ` Chuang Xu
2025-12-18 15:32               ` Peter Xu [this message]
2025-12-17 17:01 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aUQeoNveybyICXjD@x1.local \
    --to=peterx@redhat.com \
    --cc=david@kernel.org \
    --cc=farosas@suse.de \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=sgarzare@redhat.com \
    --cc=xuchuangxclwt@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.