qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Yong Huang <yong.huang@smartx.com>, Lukas Straub <lukasstraub2@web.de>
Cc: qemu-devel@nongnu.org, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH] multifd: Make the main thread yield periodically to the main loop
Date: Fri, 08 Aug 2025 10:55:25 -0300	[thread overview]
Message-ID: <87o6sp2a0i.fsf@suse.de> (raw)
In-Reply-To: <CAK9dgmbybw+WkC2C_qdZnwSYjGn3Q2Du4yjLOz+EmCx1po8YPg@mail.gmail.com>

Yong Huang <yong.huang@smartx.com> writes:

> On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstraub2@web.de> wrote:
>
>> On Fri, 8 Aug 2025 10:36:24 +0800
>> Yong Huang <yong.huang@smartx.com> wrote:
>>
>> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstraub2@web.de> wrote:
>> >
>> > > On Thu,  7 Aug 2025 10:41:17 +0800
>> > > yong.huang@smartx.com wrote:
>> > >
>> > > > From: Hyman Huang <yong.huang@smartx.com>
>> > > >
>> > > > When there are network issues like missing TCP ACKs on the send
>> > > > side during the multifd live migration. At the send side, the error
>> > > > "Connection timed out" is thrown out and source QEMU process stop
>> > > > sending data, at the receive side, The IO-channels may be blocked
>> > > > at recvmsg() and thus the main loop gets stuck and fails to respond
>> > > > to QMP commands consequently.
>> > > > ...
>> > >
>> > > Hi Hyman Huang,
>> > >
>> > > Have you tried the 'yank' command to shutdown the sockets? It exactly
>> > > meant to recover from hangs and should solve your issue.
>> > >
>> > >
>> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature
>> >
>> >
>> > Thanks for the comment and advice.
>> >
>> > Let me give more details about the migration state when the issue
>> happens:
>> >
>> > On the source side, libvirt has already aborted the migration job:
>> >
>> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63
>> > Job type:         Failed
>> > Operation:        Outgoing migration
>> >
>> > QMP query-yank shows that there is no migration yank instance:
>> >
>> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
>> > '{"execute":"query-yank"}' --pretty
>> > {
>> >   "return": [
>> >     {
>> >       "type": "chardev",
>> >       "id": "charmonitor"
>> >     },
>> >     {
>> >       "type": "chardev",
>> >       "id": "charchannel0"
>> >     },
>> >     {
>> >       "type": "chardev",
>> >       "id": "libvirt-2-virtio-format"
>> >     }
>> >   ],
>> >   "id": "libvirt-5217"
>> > }
>>
>> You are supposed to run it on the destination side, there the migration
>> yank instance should be present if qemu hangs in the migration code.
>>
>> Also, you need to execute it as an out-of-band command to bypass the
>> main loop. Like this:
>>
>> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ {"type":
>> "migration"} ] } }'
>
> In our case, Libvirt's operation about the VM on the destination side has
> been blocked
> by Migration JOB:
>
> $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
> '{"query-commands"}' --pretty
> error: Timed out during operation: cannot acquire state change lock (held
> by monitor=remoteDispatchDomainMigratePrepare3Params)
> Using Libvirt to issue the yank command can not be taken into account.
>
>
>>
>>
>> I'm not sure if libvirt can do that, maybe you need to add an
>> additional qmp socket and do it outside of libvirt. Note that you need
>> to enable the oob feature during qmp negotiation, like this:
>>
>> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }'
>
>
> No, I checked Libvirt's source code and figured out that when the QEMU
> monitor is initialized, Libvirt by default disables the OOB.
>
> Therefore, perhaps we can first enable the OOB and add the yank capability
> to Libvirt then adding the yank logic to the necessary path—in our
> instance, the migration code:
>
> qemuMigrationDstFinish:
>     if (retcode != 0) {
>         /* Check for a possible error on the monitor in case Finish was called
>          * earlier than monitor EOF handler got a chance to process the error
>          */
>         qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN);
>         goto endjob;
>     }
>
>
>
>>
>> Regards,
>> Lukas Straub
>>
>> >
>> > The libvirt migration job is stuck as the following backtrace shows; it
>> > shows that migration is waiting for the "Finish" RPC on the destination
>> > side to return.
>> >
>> > ...
>> >
>> > IMHO, the key reason for the issue is that QEMU fails to run the main
>> loop
>> > and fails to respond to QMP, which is not what we usually expected.
>> >
>> > Giving the Libvirt a window of time to issue a QMP and kill the VM is the
>> > ideal solution for this issue; this provides an automatic method.
>> >
>> > I do not dig the yank feature, perhaps it is helpful, but only manually?
>> >
>> > After all, these two options are not exclusive of one another,  I think.
>> >

Please work with Lukas to figure out whether yank can be used here. I
think that's the correct approach. If the main loop is blocked, then
some out-of-band cancellation routine is needed. migrate_cancel() could
be it, but at the moment it's not. Yank is the second best thing.

The need for a timeout is usually indicative of a design issue. In this
case, the choice of a coroutine for the incoming side is the obvious
one. Peter will tell you all about it! =)


  reply	other threads:[~2025-08-08 13:55 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07  2:41 [PATCH] multifd: Make the main thread yield periodically to the main loop yong.huang
2025-08-07  9:32 ` Lukas Straub
2025-08-07  9:36 ` Lukas Straub
2025-08-08  2:36   ` Yong Huang
2025-08-08  7:01     ` Lukas Straub
2025-08-08  8:02       ` Yong Huang
2025-08-08 13:55         ` Fabiano Rosas [this message]
2025-08-08 15:37           ` Peter Xu
2025-08-11  2:25             ` Yong Huang
2025-08-11  7:03             ` Lukas Straub
2025-08-11 13:53               ` Fabiano Rosas
2025-08-19 10:31                 ` Daniel P. Berrangé
2025-08-19 12:03                   ` Lukas Straub
2025-08-19 12:07                     ` Daniel P. Berrangé
2025-08-19 20:03                       ` Peter Xu
2025-08-11  2:27           ` Yong Huang
2025-08-08  6:36 ` Yong Huang
2025-08-08 15:42 ` Peter Xu
2025-08-11  2:02   ` Yong Huang
2025-08-19 10:19 ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o6sp2a0i.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=lukasstraub2@web.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yong.huang@smartx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).