From: Peter Xu <peterx@redhat.com>
To: "Jiří Denemark" <jdenemar@redhat.com>
Cc: Juraj Marcin <jmarcin@redhat.com>,
qemu-devel@nongnu.org,
"Dr. David Alan Gilbert" <dave@treblig.org>,
Fabiano Rosas <farosas@suse.de>
Subject: Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
Date: Tue, 30 Sep 2025 16:04:54 -0400 [thread overview]
Message-ID: <aNw35iWaNDnYXOz7@x1.local> (raw)
In-Reply-To: <aNuMe0GD0mzFbD-K@orkuz.int.mamuti.net>
On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > From: Juraj Marcin <jmarcin@redhat.com>
> > > >
> > > > Currently, when postcopy starts, the source VM starts switchover and
> > > > sends a package containing the state of all non-postcopiable devices.
> > > > When the destination loads this package, the switchover is complete and
> > > > the destination VM starts. However, if the device state load fails or
> > > > the destination side crashes, the source side is already in
> > > > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> > > > up-to-date machine state as the destination has not yet started.
> > > >
> > > > This patch introduces a new POSTCOPY_DEVICE state which is active
> > > > while the destination machine is loading the device state, is not yet
> > > > running, and the source side can be resumed in case of a migration
> > > > failure.
> > > >
> > > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> > > > side uses a PONG message that is a response to a PING message processed
> > > > just before the POSTCOPY_RUN command that starts the destination VM.
> > > > Thus, this change does not require any changes on the destination side
> > > > and is effective even with older destination versions.
> > >
> > > Thanks, this will help libvirt as we think that the migration can be
> > > safely aborted unless we successfully called "cont" and thus we just
> > > kill QEMU on the destination. But since QEMU on the source already
> > > entered postcopy-active, we can't cancel the migration and the result is
> > > a paused VM with no way of recovering it.
> > >
> > > This series will make the situation better as the source will stay in
> > > postcopy-device until the destination successfully loads device data.
> > > There's still room for some enhancement though. Depending on how fast
> > > this loading is libvirt may issue cont before device data is loaded (the
> > > destination is already in postcopy-active at this point), which always
> > > succeeds as it only marks the domain to be autostarted, but the actual
> > > start may fail later. When discussing this with Juraj we agreed on
> > > introducing the new postcopy-device state on the destination as well to
> >
> > I used to think and define postcopy-active be the state we should never be
> > able to cancel it anymore, implying that the real postcopy process is in
> > progress, and also implying the state where we need to start assume the
> > latest VM pages are spread on both sides, not one anymore. Cancellation or
> > killing either side means crashing VM then.
>
> Right, although it's unfortunately not the case now as the source is in
> postcopy-active even though the complete state is still on the source.
>
> > It could be a good thing indeed to have postcopy-device on dest too from
> > that regard, because having postcopy-device on dest can mark out the small
> > initial window when dest qemu hasn't yet start to generate new data but
> > only applying old data (device data, which src also owns a copy). From
> > that POV, that indeed does not belong to the point if we define
> > postcopy-active as above.
> >
> > IOW, also with such definition, setting postcopy-active on dest QEMU right
> > at the entry of ram load thread (what we do right now..) is too early.
> >
> > > make sure libvirt will only call cont once device data was successfully
> > > loaded so that we always get a proper result when running cont. But it
> >
> > Do we know an estimate of how much extra downtime this would introduce?
> >
> > We should have discussed this in a previous thread, the issue is if we cont
> > only after device loaded, then dest QEMU may need to wait a while until it
> > receives the cont from libvirt, that will contribute to the downtime. It
> > would best be avoided, or if it's negligible then it's fine too but I'm not
> > sure whether it's guaranteed to be negligible..
>
> We start QEMU with -S so it always needs to wait for cont from libvirt.
> We wait for postcopy-active on the destination before sending cont. So
> currently it can arrive while QEMU is still loading device state or when
> this is already done. I was just suggesting to always wait for the
> device state to be loaded before sending cont. So in some cases it would
> arrive a bit later while in other cases nothing would change. It's just
> a matter of waking up a thread waiting for postcopy-active and sending
> the command back to QEMU. There's no communication with the other host
> at this point so I'd expect the difference to be negligible. And as I
> said depending on how fast device state loading vs transferring
> migration control from libvirt on the source to the destination we may
> already be sending cont when QEMU is done.
Ah OK, I think this is not a major concern, until it is justified to.
>
> But anyway, this would only be helpful if there's a way to actually
> cancel migration on the source when cont fails.
>
> > If the goal is to make sure libvirt knows what is happening, can it still
> > relies on the event emitted, in this case, RESUME? We can also reorg how
> > postcopy-device and postcopy-active states will be reported on dest, then
> > they'll help if RESUME is too coarse grained.
>
> The main goal is to make sure we don't end up with vCPUs paused on both
> sides during a postcopy migration that can't be recovered nor canceled
> thus effectively crashing the VM.
Right, I assume that's what Juraj's series is trying to fix. After this
series lands, I don't see why it would happen. But indeed I'm still
expecting the block drive (including their locks) to behave all fine.
>
> > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > do vm_start(), at the end of which RESUME event will be generated. So
> > RESUME currently implies both disk activation success, and vm start worked.
> >
> > > may still fail when locking disks fails (not sure if this is the only
> > > way cont may fail). In this case we cannot cancel the migration on the
> >
> > Is there any known issue with locking disks that dest would fail? This
> > really sound like we should have the admin taking a look.
>
> Oh definitely, it would be some kind of an storage access issue on the
> destination. But we'd like to give the admin an option to actually do
> anything else than just killing the VM :-) Either by automatically
> canceling the migration or allowing recovery once storage issues are
> solved.
The problem is, if the storage locking stopped working properly, then how
to guarantee the shared storage itself is working properly?
When I was replying previously, I was expecting the admin taking a look to
fix the storage, I didn't expect the VM can still be recovered anymore if
there's no confidence that the block devices will work all fine. The
locking errors to me may imply a block corruption already, or should I not
see it like that?
Fundamentally, "crashing the VM" doesn't lose anything from the block POV
because it's always persistent when synced. It's almost only about RAM
that is getting lost, alongside it's about task status, service
availability, and the part of storage that was not flushed to backends.
Do we really want to add anything more complex when shared storage has
locking issues? Maybe there's known issues on locking that we're 100% sure
the storage is fine, but only the locking went wrong?
IIUC, the hope is after this series lands we should close the gap for
almost all the rest paths that may cause both sides to HALT for a postcopy,
except for a storage issue with lockings. But I'm not sure if I missed
something.
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2025-09-30 20:06 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
2025-09-19 16:12 ` Fabiano Rosas
2025-09-15 11:59 ` [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed() Juraj Marcin
2025-09-19 14:57 ` Peter Xu
2025-09-22 11:26 ` Juraj Marcin
2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
2025-09-19 15:53 ` Peter Xu
2025-09-19 16:46 ` Fabiano Rosas
2025-09-22 12:58 ` Juraj Marcin
2025-09-22 15:51 ` Peter Xu
2025-09-22 17:40 ` Fabiano Rosas
2025-09-22 17:48 ` Peter Xu
2025-09-23 14:58 ` Juraj Marcin
2025-09-23 16:17 ` Peter Xu
2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
2025-09-19 16:58 ` Peter Xu
2025-09-19 17:50 ` Peter Xu
2025-09-22 13:34 ` Juraj Marcin
2025-09-22 16:16 ` Peter Xu
2025-09-23 14:23 ` Juraj Marcin
2025-09-25 11:54 ` Jiří Denemark
2025-09-25 18:22 ` Peter Xu
2025-09-30 7:53 ` Jiří Denemark
2025-09-30 20:04 ` Peter Xu [this message]
2025-10-01 8:43 ` Jiří Denemark
2025-10-01 11:05 ` Dr. David Alan Gilbert
2025-10-01 14:26 ` Jiří Denemark
2025-10-01 15:53 ` Dr. David Alan Gilbert
2025-10-01 15:10 ` Daniel P. Berrangé
2025-10-02 12:17 ` Jiří Denemark
2025-10-02 13:12 ` Dr. David Alan Gilbert
2025-10-01 10:09 ` Juraj Marcin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aNw35iWaNDnYXOz7@x1.local \
--to=peterx@redhat.com \
--cc=dave@treblig.org \
--cc=farosas@suse.de \
--cc=jdenemar@redhat.com \
--cc=jmarcin@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).