From: Fabiano Rosas <farosas@suse.de>
To: Yong Huang <yong.huang@smartx.com>, Lukas Straub <lukasstraub2@web.de>
Cc: qemu-devel@nongnu.org, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH] multifd: Make the main thread yield periodically to the main loop
Date: Fri, 08 Aug 2025 10:55:25 -0300 [thread overview]
Message-ID: <87o6sp2a0i.fsf@suse.de> (raw)
In-Reply-To: <CAK9dgmbybw+WkC2C_qdZnwSYjGn3Q2Du4yjLOz+EmCx1po8YPg@mail.gmail.com>
Yong Huang <yong.huang@smartx.com> writes:
> On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstraub2@web.de> wrote:
>
>> On Fri, 8 Aug 2025 10:36:24 +0800
>> Yong Huang <yong.huang@smartx.com> wrote:
>>
>> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstraub2@web.de> wrote:
>> >
>> > > On Thu, 7 Aug 2025 10:41:17 +0800
>> > > yong.huang@smartx.com wrote:
>> > >
>> > > > From: Hyman Huang <yong.huang@smartx.com>
>> > > >
>> > > > When there are network issues like missing TCP ACKs on the send
>> > > > side during the multifd live migration. At the send side, the error
>> > > > "Connection timed out" is thrown out and source QEMU process stop
>> > > > sending data, at the receive side, The IO-channels may be blocked
>> > > > at recvmsg() and thus the main loop gets stuck and fails to respond
>> > > > to QMP commands consequently.
>> > > > ...
>> > >
>> > > Hi Hyman Huang,
>> > >
>> > > Have you tried the 'yank' command to shutdown the sockets? It exactly
>> > > meant to recover from hangs and should solve your issue.
>> > >
>> > >
>> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature
>> >
>> >
>> > Thanks for the comment and advice.
>> >
>> > Let me give more details about the migration state when the issue
>> happens:
>> >
>> > On the source side, libvirt has already aborted the migration job:
>> >
>> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63
>> > Job type: Failed
>> > Operation: Outgoing migration
>> >
>> > QMP query-yank shows that there is no migration yank instance:
>> >
>> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
>> > '{"execute":"query-yank"}' --pretty
>> > {
>> > "return": [
>> > {
>> > "type": "chardev",
>> > "id": "charmonitor"
>> > },
>> > {
>> > "type": "chardev",
>> > "id": "charchannel0"
>> > },
>> > {
>> > "type": "chardev",
>> > "id": "libvirt-2-virtio-format"
>> > }
>> > ],
>> > "id": "libvirt-5217"
>> > }
>>
>> You are supposed to run it on the destination side, there the migration
>> yank instance should be present if qemu hangs in the migration code.
>>
>> Also, you need to execute it as an out-of-band command to bypass the
>> main loop. Like this:
>>
>> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ {"type":
>> "migration"} ] } }'
>
> In our case, Libvirt's operation about the VM on the destination side has
> been blocked
> by Migration JOB:
>
> $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
> '{"query-commands"}' --pretty
> error: Timed out during operation: cannot acquire state change lock (held
> by monitor=remoteDispatchDomainMigratePrepare3Params)
> Using Libvirt to issue the yank command can not be taken into account.
>
>
>>
>>
>> I'm not sure if libvirt can do that, maybe you need to add an
>> additional qmp socket and do it outside of libvirt. Note that you need
>> to enable the oob feature during qmp negotiation, like this:
>>
>> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }'
>
>
> No, I checked Libvirt's source code and figured out that when the QEMU
> monitor is initialized, Libvirt by default disables the OOB.
>
> Therefore, perhaps we can first enable the OOB and add the yank capability
> to Libvirt then adding the yank logic to the necessary path—in our
> instance, the migration code:
>
> qemuMigrationDstFinish:
> if (retcode != 0) {
> /* Check for a possible error on the monitor in case Finish was called
> * earlier than monitor EOF handler got a chance to process the error
> */
> qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN);
> goto endjob;
> }
>
>
>
>>
>> Regards,
>> Lukas Straub
>>
>> >
>> > The libvirt migration job is stuck as the following backtrace shows; it
>> > shows that migration is waiting for the "Finish" RPC on the destination
>> > side to return.
>> >
>> > ...
>> >
>> > IMHO, the key reason for the issue is that QEMU fails to run the main
>> loop
>> > and fails to respond to QMP, which is not what we usually expected.
>> >
>> > Giving the Libvirt a window of time to issue a QMP and kill the VM is the
>> > ideal solution for this issue; this provides an automatic method.
>> >
>> > I do not dig the yank feature, perhaps it is helpful, but only manually?
>> >
>> > After all, these two options are not exclusive of one another, I think.
>> >
Please work with Lukas to figure out whether yank can be used here. I
think that's the correct approach. If the main loop is blocked, then
some out-of-band cancellation routine is needed. migrate_cancel() could
be it, but at the moment it's not. Yank is the second best thing.
The need for a timeout is usually indicative of a design issue. In this
case, the choice of a coroutine for the incoming side is the obvious
one. Peter will tell you all about it! =)
next prev parent reply other threads:[~2025-08-08 13:55 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-07 2:41 [PATCH] multifd: Make the main thread yield periodically to the main loop yong.huang
2025-08-07 9:32 ` Lukas Straub
2025-08-07 9:36 ` Lukas Straub
2025-08-08 2:36 ` Yong Huang
2025-08-08 7:01 ` Lukas Straub
2025-08-08 8:02 ` Yong Huang
2025-08-08 13:55 ` Fabiano Rosas [this message]
2025-08-08 15:37 ` Peter Xu
2025-08-11 2:25 ` Yong Huang
2025-08-11 7:03 ` Lukas Straub
2025-08-11 13:53 ` Fabiano Rosas
2025-08-19 10:31 ` Daniel P. Berrangé
2025-08-19 12:03 ` Lukas Straub
2025-08-19 12:07 ` Daniel P. Berrangé
2025-08-19 20:03 ` Peter Xu
2025-08-11 2:27 ` Yong Huang
2025-08-08 6:36 ` Yong Huang
2025-08-08 15:42 ` Peter Xu
2025-08-11 2:02 ` Yong Huang
2025-08-19 10:19 ` Daniel P. Berrangé
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o6sp2a0i.fsf@suse.de \
--to=farosas@suse.de \
--cc=lukasstraub2@web.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=yong.huang@smartx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).