All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Yong Huang <yong.huang@smartx.com>, Lukas Straub <lukasstraub2@web.de>
Cc: qemu-devel@nongnu.org, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH] multifd: Make the main thread yield periodically to the main loop
Date: Fri, 08 Aug 2025 10:55:25 -0300	[thread overview]
Message-ID: <87o6sp2a0i.fsf@suse.de> (raw)
In-Reply-To: <CAK9dgmbybw+WkC2C_qdZnwSYjGn3Q2Du4yjLOz+EmCx1po8YPg@mail.gmail.com>

Yong Huang <yong.huang@smartx.com> writes:

> On Fri, Aug 8, 2025 at 3:02 PM Lukas Straub <lukasstraub2@web.de> wrote:
>
>> On Fri, 8 Aug 2025 10:36:24 +0800
>> Yong Huang <yong.huang@smartx.com> wrote:
>>
>> > On Thu, Aug 7, 2025 at 5:36 PM Lukas Straub <lukasstraub2@web.de> wrote:
>> >
>> > > On Thu,  7 Aug 2025 10:41:17 +0800
>> > > yong.huang@smartx.com wrote:
>> > >
>> > > > From: Hyman Huang <yong.huang@smartx.com>
>> > > >
>> > > > When there are network issues like missing TCP ACKs on the send
>> > > > side during the multifd live migration. At the send side, the error
>> > > > "Connection timed out" is thrown out and source QEMU process stop
>> > > > sending data, at the receive side, The IO-channels may be blocked
>> > > > at recvmsg() and thus the main loop gets stuck and fails to respond
>> > > > to QMP commands consequently.
>> > > > ...
>> > >
>> > > Hi Hyman Huang,
>> > >
>> > > Have you tried the 'yank' command to shutdown the sockets? It exactly
>> > > meant to recover from hangs and should solve your issue.
>> > >
>> > >
>> https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#yank-feature
>> >
>> >
>> > Thanks for the comment and advice.
>> >
>> > Let me give more details about the migration state when the issue
>> happens:
>> >
>> > On the source side, libvirt has already aborted the migration job:
>> >
>> > $ virsh domjobinfo fdecd242-f278-4308-8c3b-46e144e55f63
>> > Job type:         Failed
>> > Operation:        Outgoing migration
>> >
>> > QMP query-yank shows that there is no migration yank instance:
>> >
>> > $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
>> > '{"execute":"query-yank"}' --pretty
>> > {
>> >   "return": [
>> >     {
>> >       "type": "chardev",
>> >       "id": "charmonitor"
>> >     },
>> >     {
>> >       "type": "chardev",
>> >       "id": "charchannel0"
>> >     },
>> >     {
>> >       "type": "chardev",
>> >       "id": "libvirt-2-virtio-format"
>> >     }
>> >   ],
>> >   "id": "libvirt-5217"
>> > }
>>
>> You are supposed to run it on the destination side, there the migration
>> yank instance should be present if qemu hangs in the migration code.
>>
>> Also, you need to execute it as an out-of-band command to bypass the
>> main loop. Like this:
>>
>> '{"exec-oob": "yank", "id": "yank0", "arguments": {"instances": [ {"type":
>> "migration"} ] } }'
>
> In our case, Libvirt's operation about the VM on the destination side has
> been blocked
> by Migration JOB:
>
> $ virsh qemu-monitor-command fdecd242-f278-4308-8c3b-46e144e55f63
> '{"query-commands"}' --pretty
> error: Timed out during operation: cannot acquire state change lock (held
> by monitor=remoteDispatchDomainMigratePrepare3Params)
> Using Libvirt to issue the yank command can not be taken into account.
>
>
>>
>>
>> I'm not sure if libvirt can do that, maybe you need to add an
>> additional qmp socket and do it outside of libvirt. Note that you need
>> to enable the oob feature during qmp negotiation, like this:
>>
>> '{ "execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }'
>
>
> No, I checked Libvirt's source code and figured out that when the QEMU
> monitor is initialized, Libvirt by default disables the OOB.
>
> Therefore, perhaps we can first enable the OOB and add the yank capability
> to Libvirt then adding the yank logic to the necessary path—in our
> instance, the migration code:
>
> qemuMigrationDstFinish:
>     if (retcode != 0) {
>         /* Check for a possible error on the monitor in case Finish was called
>          * earlier than monitor EOF handler got a chance to process the error
>          */
>         qemuDomainCheckMonitor(driver, vm, QEMU_ASYNC_JOB_MIGRATION_IN);
>         goto endjob;
>     }
>
>
>
>>
>> Regards,
>> Lukas Straub
>>
>> >
>> > The libvirt migration job is stuck as the following backtrace shows; it
>> > shows that migration is waiting for the "Finish" RPC on the destination
>> > side to return.
>> >
>> > ...
>> >
>> > IMHO, the key reason for the issue is that QEMU fails to run the main
>> loop
>> > and fails to respond to QMP, which is not what we usually expected.
>> >
>> > Giving the Libvirt a window of time to issue a QMP and kill the VM is the
>> > ideal solution for this issue; this provides an automatic method.
>> >
>> > I do not dig the yank feature, perhaps it is helpful, but only manually?
>> >
>> > After all, these two options are not exclusive of one another,  I think.
>> >

Please work with Lukas to figure out whether yank can be used here. I
think that's the correct approach. If the main loop is blocked, then
some out-of-band cancellation routine is needed. migrate_cancel() could
be it, but at the moment it's not. Yank is the second best thing.

The need for a timeout is usually indicative of a design issue. In this
case, the choice of a coroutine for the incoming side is the obvious
one. Peter will tell you all about it! =)


  reply	other threads:[~2025-08-08 13:55 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07  2:41 [PATCH] multifd: Make the main thread yield periodically to the main loop yong.huang
2025-08-07  9:32 ` Lukas Straub
2025-08-07  9:36 ` Lukas Straub
2025-08-08  2:36   ` Yong Huang
2025-08-08  7:01     ` Lukas Straub
2025-08-08  8:02       ` Yong Huang
2025-08-08 13:55         ` Fabiano Rosas [this message]
2025-08-08 15:37           ` Peter Xu
2025-08-11  2:25             ` Yong Huang
2025-08-11  7:03             ` Lukas Straub
2025-08-11 13:53               ` Fabiano Rosas
2025-08-19 10:31                 ` Daniel P. Berrangé
2025-08-19 12:03                   ` Lukas Straub
2025-08-19 12:07                     ` Daniel P. Berrangé
2025-08-19 20:03                       ` Peter Xu
2025-08-11  2:27           ` Yong Huang
2025-08-08  6:36 ` Yong Huang
2025-08-08 15:42 ` Peter Xu
2025-08-11  2:02   ` Yong Huang
2025-08-19 10:19 ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o6sp2a0i.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=lukasstraub2@web.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yong.huang@smartx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.