From: "Dr. David Alan Gilbert" <dave@treblig.org>
To: Peter Xu <peterx@redhat.com>
Cc: "Lukas Straub" <lukasstraub2@web.de>,
qemu-devel@nongnu.org, "Juraj Marcin" <jmarcin@redhat.com>,
"Fabiano Rosas" <farosas@suse.de>,
"Markus Armbruster" <armbru@redhat.com>,
"Daniel P . Berrangé" <berrange@redhat.com>,
"Lukáš Doktor" <ldoktor@redhat.com>,
"Juan Quintela" <quintela@trasno.org>,
"Zhang Chen" <zhangckid@gmail.com>,
zhanghailiang@xfusion.com, "Li Zhijian" <lizhijian@fujitsu.com>,
"Jason Wang" <jasowang@redhat.com>
Subject: Re: [PATCH 1/3] migration/colo: Deprecate COLO migration framework
Date: Wed, 21 Jan 2026 01:25:32 +0000 [thread overview]
Message-ID: <aXArDHMRAohmUt51@gallifrey> (raw)
In-Reply-To: <aW_ccMSY4xJlRVn2@x1.local>
* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Jan 20, 2026 at 07:04:09PM +0000, Dr. David Alan Gilbert wrote:
<snip>
> > > (2) Failure happens _after_ applying the new checkpoint, but _before_ the
> > > whole checkpoint is applied.
> > >
> > > To be explicit, consider qemu_load_device_state() when the process of
> > > colo_incoming_process_checkpoint() failed. It means SVM applied
> > > partial of PVM's checkpoint, I think it should mean PVM is completely
> > > corrupted.
> >
> > As long as the SVM has got the entire checkpoint, then it *can* apply it all
> > and carry on from that point.
>
> Does it mean we assert() that qemu_load_device_state() will always success
> for COLO syncs?
Not sure; I'd expect if that load fails then the SVM fails; if that happens
on a periodic checkpoint then the PVM should carry on.
> Logically post_load() can invoke anything and I'm not sure if something can
> start to fail, but I confess I don't know an existing device that can
> trigger it.
Like a postcopy, it shouldn't fail unless there's an underlying failure
(e.g. storage died)
> Lukas told me something was broken though with pc machine type, on
> post_load() not re-entrant. I think it might be possible though when
> post_load() is relevant to some device states (that guest driver can change
> between two checkpoint loads), but that's still only theoretical. So maybe
> we can indeed assert it here.
I don't understand that non re-entrant bit?
> >
> > > Here either (1.b) or (2) seems fatal to me on the whole high level design.
> > > Periodical syncs with x-checkpoint-delay can make this easier to happen, so
> > > larger windows of critical failures. That's also why I think it's
> > > confusing COLO prefers more checkpoints - while it helps sync things up, it
> > > enlarges high risk window and overall overhead.
> >
> > No, there should be no point at which a failure leaves the SVM without a checkpoint
> > that it can apply to take over.
> >
> > > > > > I have quite a few more performance and cleanup patches on my hands,
> > > > > > for example to transfer dirty memory between checkpoints.
> > > > > >
> > > > > > >
> > > > > > > IIUC, the critical path of COLO shouldn't be migration on its own? It
> > > > > > > should be when heartbeat gets lost; that normally should happen when two
> > > > > > > VMs are in sync. In this path, I don't see how multifd helps.. because
> > > > > > > there's no migration happening, only the src recording what has changed.
> > > > > > > Hence I think some number with description of the measurements may help us
> > > > > > > understand how important multifd is to COLO.
> > > > > > >
> > > > > > > Supporting multifd will cause new COLO functions to inject into core
> > > > > > > migration code paths (even if not much..). I want to make sure such (new)
> > > > > > > complexity is justified. I also want to avoid introducing a feature only
> > > > > > > because "we have XXX, then let's support XXX in COLO too, maybe some day
> > > > > > > it'll be useful".
> > > > > >
> > > > > > What COLO needs from migration at the low level:
> > > > > >
> > > > > > Primary/Outgoing side:
> > > > > >
> > > > > > Not much actually, we just need a way to incrementally send the
> > > > > > dirtied memory and the full device state.
> > > > > > Also, we ensure that migration never actually finishes since we will
> > > > > > never do a switchover. For example we never set
> > > > > > RAMState::last_stage with COLO.
> > > > > >
> > > > > > Secondary/Incoming side:
> > > > > >
> > > > > > colo cache:
> > > > > > Since the secondary always needs to be ready to take over (even during
> > > > > > checkpointing), we can not write the received ram pages directly to
> > > > > > the guest ram to prevent having half of the old and half of the new
> > > > > > contents.
> > > > > > So we redirect the received ram pages to the colo cache. This is
> > > > > > basically a mirror of the primary side ram.
> > > > > > It also simplifies the primary side since from it's point of view it's
> > > > > > just a normal migration target. So primary side doesn't have to care
> > > > > > about dirtied pages on the secondary for example.
> > > > > >
> > > > > > Dirty Bitmap:
> > > > > > With COLO we also need a dirty bitmap on the incoming side to track
> > > > > > 1. pages dirtied by the secondary guest
> > > > > > 2. pages dirtied by the primary guest (incoming ram pages)
> > > > > > In the last step during the checkpointing, this bitmap is then used
> > > > > > to overwrite the guest ram with the colo cache so the secondary guest
> > > > > > is in sync with the primary guest.
> > > > > >
> > > > > > All this individually is very little code as you can see from my
> > > > > > multifd patch. Just something to keep in mind I guess.
> > > > > >
> > > > > >
> > > > > > At the high level we have the COLO framework outgoing and incoming
> > > > > > threads which just tell the migration code to:
> > > > > > Send all ram pages (qemu_savevm_live_state()) on the outgoing side
> > > > > > paired with a qemu_loadvm_state_main on the incoming side.
> > > > > > Send the device state (qemu_save_device_state()) paired with writing
> > > > > > that stream to a buffer on the incoming side.
> > > > > > And finally flusing the colo cache and loading the device state on the
> > > > > > incoming side.
> > > > > >
> > > > > > And of course we coordinate with the colo block replication and
> > > > > > colo-compare.
> > > > >
> > > > > Thank you. Maybe you should generalize some of the explanations and put it
> > > > > into docs/devel/migration/ somewhere. I think many of them are not
> > > > > mentioned in the doc on how COLO works internally.
> > > > >
> > > > > Let me ask some more questions while I'm reading COLO today:
> > > > >
> > > > > - For each of the checkpoint (colo_do_checkpoint_transaction()), COLO will
> > > > > do the following:
> > > > >
> > > > > bql_lock()
> > > > > vm_stop_force_state(RUN_STATE_COLO) # stop vm
> > > > > bql_unlock()
> > > > >
> > > > > ...
> > > > >
> > > > > bql_lock()
> > > > > qemu_save_device_state() # into a temp buffer fb
> > > > > bql_unlock()
> > > > >
> > > > > ...
> > > > >
> > > > > qemu_savevm_state_complete_precopy() # send RAM, directly to the wire
> > > > > qemu_put_buffer(fb) # push temp buffer fb to wire
> > > > >
> > > > > ...
> > > > >
> > > > > bql_lock()
> > > > > vm_start() # start vm
> > > > > bql_unlock()
> > > > >
> > > > > A few questions that I didn't ask previously:
> > > > >
> > > > > - If VM is stopped anyway, why putting the device states into a temp
> > > > > buffer, instead of using what we already have for precopy phase, or
> > > > > just push everything directly to the wire?
> > > >
> > > > Actually we only do that to get the size of the device state and send
> > > > the size out-of-band, since we can not use qemu_load_device_state()
> > > > directly on the secondary side and look for the in-band EOF.
> > >
> > > I also don't understand why the size is needed..
> > >
> > > Currently the streaming protocol for COLO is:
> > >
> > > - ...
> > > - COLO_MESSAGE_VMSTATE_SEND
> > > - RAM data
> > > - EOF
> > > - COLO_MESSAGE_VMSTATE_SIZE
> > > - non-RAM data
> > > - EOF
> > >
> > > My question is about, why can't we do this instead?
> > >
> > > - ...
> > > - COLO_MESSAGE_VMSTATE_SEND
> > > - RAM data
>
> [1]
>
> > > - non-RAM data
> > > - EOF
> > >
> > > If the VM is stoppped during the whole process anyway..
> > >
> > > Here RAM/non-RAM data all are vmstates, and logically can also be loaded in
> > > one shot of a vmstate load loop.
> >
> > You might be able to; in that case you would have to stream the
> > entire thing into a buffer on the secondary rather than applying the
> > RAM updates to the colo cache.
>
> I thought the colo cache is already such a buffering when receiving at [1]
> above? Then we need to flush the colo cache (including scan the SVM bitmap
> and only flush those pages in colo cache) like before.
>
> If something went wrong (e.g. channel broken during receiving non-ram
> device states), SVM can directly drop all colo cache as the latest
> checkpoint isn't complete.
Oh, I think I've remembered why it's necessary to split it into RAM and non-RAM;
you can't parse a non-RAM stream and know when you've got an EOF flag in the stream;
especially for stuff that's open coded (like some of virtio); so there's
no way to write a 'load until EOF' into a simple RAM buffer; you need to be
given an explicit size to know how much to expect.
You could do it for the RAM, but you'd need to write a protocol parser
to follow the stream to watch for the EOF. It's actuallly harder with multifd;
how would you make a temporary buffer with multiple streams like that?
> > The thought of using userfaultfd-write had floated around at some time
> > as ways to optimise this.
>
> It's an interesting idea. Yes it looks working, but as Lukas said, it looks
> still unbounded.
>
> One idea to provide a strict bound:
>
> - admin sets a proper buffer to limit the extra pages to remember on SVM,
> should be much smaller than total guest mem, but admin should make sure
> in 99.99% cases it won't hit the limit with a proper x-checkpoint-delay,
>
> - if limit triggered, both VMs needs to pause (initiated by SVM), SVM
> needs to explicitly request a checkpoint to src,
>
> - VMs can only start again after two VMs sync again
Right, that should be doable with a userfault-write.
Dave
> Thanks,
>
> --
> Peter Xu
>
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ dave @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
next prev parent reply other threads:[~2026-01-21 1:26 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-14 19:56 [PATCH 0/3] migration: deprecations and removals for 11.0 Peter Xu
2026-01-14 19:56 ` [PATCH 1/3] migration/colo: Deprecate COLO migration framework Peter Xu
2026-01-14 20:11 ` Peter Xu
2026-01-15 21:49 ` Lukas Straub
2026-01-15 22:39 ` Peter Xu
2026-01-15 22:59 ` Dr. David Alan Gilbert
2026-01-15 23:38 ` Peter Xu
2026-01-16 0:37 ` Dr. David Alan Gilbert
2026-01-16 8:16 ` Zhang Chen
2026-01-16 7:47 ` Zhang Chen
2026-01-17 19:49 ` Lukas Straub
2026-01-17 20:15 ` Lukas Straub
2026-01-19 22:33 ` Peter Xu
2026-01-20 11:48 ` Lukas Straub
2026-01-20 15:58 ` Peter Xu
2026-01-20 19:04 ` Dr. David Alan Gilbert
2026-01-20 19:50 ` Peter Xu
2026-01-21 1:25 ` Dr. David Alan Gilbert [this message]
2026-01-21 17:03 ` Peter Xu
2026-01-21 17:31 ` Dr. David Alan Gilbert
2026-01-21 20:22 ` Peter Xu
2026-01-21 21:31 ` Dr. David Alan Gilbert
2026-01-21 22:22 ` Peter Xu
2026-01-16 7:05 ` Zhang Chen
2026-01-16 9:46 ` Daniel P. Berrangé
2026-01-16 13:56 ` Peter Xu
2026-01-16 6:26 ` Markus Armbruster
2026-01-16 8:22 ` Zhang Chen
2026-01-16 9:41 ` Markus Armbruster
2026-01-16 14:08 ` Peter Xu
2026-01-16 15:33 ` Markus Armbruster
2026-01-14 21:13 ` Dr. David Alan Gilbert
2026-01-15 5:56 ` Markus Armbruster
2026-01-15 18:53 ` Peter Xu
2026-01-14 19:56 ` [PATCH 2/3] migration: Remove zero-blocks capability Peter Xu
2026-01-15 6:00 ` Markus Armbruster
2026-01-15 18:53 ` Peter Xu
2026-01-14 19:56 ` [PATCH 3/3] migration: Remove fd: support on files Peter Xu
2026-01-14 22:10 ` Peter Xu
2026-01-15 12:15 ` Prasad Pandit
2026-01-15 17:39 ` Peter Xu
2026-01-15 6:11 ` [PATCH 0/3] migration: deprecations and removals for 11.0 Markus Armbruster
2026-01-15 18:58 ` Peter Xu
2026-01-15 14:37 ` Fabiano Rosas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aXArDHMRAohmUt51@gallifrey \
--to=dave@treblig.org \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=farosas@suse.de \
--cc=jasowang@redhat.com \
--cc=jmarcin@redhat.com \
--cc=ldoktor@redhat.com \
--cc=lizhijian@fujitsu.com \
--cc=lukasstraub2@web.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@trasno.org \
--cc=zhangckid@gmail.com \
--cc=zhanghailiang@xfusion.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.