From: Fabiano Rosas <farosas@suse.de>
To: Prasad Pandit <ppandit@redhat.com>
Cc: qemu-devel@nongnu.org, Peter Xu <peterx@redhat.com>,
Prasad Pandit <pjp@fedoraproject.org>
Subject: Re: [PATCH] migration: introduce MIGRATION_STATUS_FAILING
Date: Tue, 23 Dec 2025 12:04:13 -0300 [thread overview]
Message-ID: <87cy45kzo2.fsf@suse.de> (raw)
In-Reply-To: <CAE8KmOweak15R4Ji6F50b_za67q=un_GDSEMGCRTYeQKod7zQA@mail.gmail.com>
Prasad Pandit <ppandit@redhat.com> writes:
> Hello Fabiano,
>
> * Thanks so much for the quick comments.
>
> On Mon, 22 Dec 2025 at 20:00, Fabiano Rosas <farosas@suse.de> wrote:
>> This doesn't look like it's caused by set-capabilities. Seems like:
>> Please clarify, there might be some other bug lurking around, such as a
>> locking issue with qemu_file_lock or even the BQL.
>>
>> Also, if possible provide an upstream backtrace, or at least mention the
>> code path based on upstream code. It's ok to keep the downstream
>> backtrace, but point that out in the commit message.
>
> * Right, migrate_fd_cleanup was renamed to migration_cleanup() in
> -> https://gitlab.com/qemu-project/qemu/-/commit/4bbadfc55e6ec608df75911b4360e6e995daa28c
Yep, 10 years from now someone will look at the code and the commit
message and be confused when they don't match. Also, for anyone
backporting or searching for bug fixes, it's good to keep things tight.
I can amend while queuing if you don't mind.
> ===
> libvirtd(8) 19:03:07.260+0000: 396587: error :
> qemuMigrationJobCheckStatus:2056 : operation failed: job 'migration
> out' failed: Unable to write to socket: Connection reset by peer
> libvirtd(8) 19:03:07.261+0000: 396587: info : qemuMonitorSend:836 :
> QEMU_MONITOR_SEND_MSG: mon=0xffffa000e010
> msg={"execute":"migrate-set-capabilities","arguments":
>
>
> qemu-system-aarch64:initiating migration
> qemu-system-aarch64: migrate_fd_cleanup: to_dst_file: 1: 0xaaaaccce0110
> qemu-system-aarch64: migrate_fd_cleanup: before multifd_send_shutdown:
> 0 <== multifd disabled
> qemu-system-aarch64: migrate_fd_cleanup: to_dst_file: 2: (nil)
> qemu-system-aarch64: Unable to write to socket: Connection reset by peer
> qemu-system-aarch64: ../util/yank.c:107: yank_unregister_instance:
> Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
> qemu-system-aarch64:shutting down, reason=crashed
> ===
> * As seen above, when connection is reset
> 1) libvirtd(8) sends the migrate-set-capabilities command to QEMU
> to reset the migration options to false(0). This leads to resetting of
> s->capabilities[MIGRATION_CAPABILITY_MULTIFD] to false.
> 2) When migration_cleanup (earlier migrate_fd_cleanup) is called
> after above reset
> migration_cleanup
> -> multifd_send_shutdown
> -> if (!migrate_multifd()) {
> return; <== returns because _CAPABILITY_MULTIFD
> is reset to false(0).
> }
> when _CAPABILITY_MULTIFD is reset to false,
> multifd_send_shutdown() returns without really doing its multifd
> cleanup, ie. multifd objects still stay alive, are not freed.
>
> * And that leads to the said assert(3) failure in
> migration_cleanup()
> {
> ...
> multifd_send_shutdown() <== does not cleanup
Ok, I forgot there were yank functions for the multifd channels as well.
It seems multifd.c could be made more robust to not require checking
migrate_multifd() so much. For the shutdown call for instance, checking
(!multifd_send_state) would suffice. Maybe a general point for us to
keep in mind, that relying on such a high level knob might not be the
best idea.
> ...
> yank_unregister_instance(MIGRATION_YANK_INSTANCE); <==
> assert(3) failure
> }
>
>> I'm fine with the general idea:
>>
>> i) FAILED and CANCELLED are terminal states. It makes sense to not have
>> work happen after they're set.
>>
>> ii) Using an intermediate state, assuming locking/atomic are correct is
>> a suitable fix for the issue.
>>
>> iii) Using a FAILING status seems appropriate.
>>
>> However,
>>
>> It would be great if we could stop exposing implementation details via
>> QAPI. Does the user really need to see events for CANCELLING and
>> FAILING?
>
> * Not really; libvirtd(8)/user only needs to know about the success OR
> failure and appropriate reasons.
>
>> This is a good example where having MigrationStatus makes the code more
>> complicated. This could be exiting=true, running=false, etc. It
>> shouldn't matter why this routine failed.
>
> * True, it is rather complicated. Though currently we have 17-18
> migration states defined, going by the functions
> migration_is_running(), migration_is_active(), and
> migration_has_failed(), migration process really has only 3-4 states:
> 0 -> stopped (non-active, not-started-yet)
> 1 -> running/active (sending-receive migration data)
> 2 -> paused/active (not-running, not-send-recv-migration-data)
> 3 -> stopped/failed/completed (non-active, not-running)
> Other interim states of initiating/setting connections OR
> cancelling/failing etc. could be tracked differently.
>
Yes, the amount of states migration_is_running() checks is an indication
that we're making life harder for ourselves. It would be nice if we
could have some idea how to improve this the next time a patch like this
comes along.
next prev parent reply other threads:[~2025-12-23 15:05 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-22 11:48 [PATCH] migration: introduce MIGRATION_STATUS_FAILING Prasad Pandit
2025-12-22 14:29 ` Fabiano Rosas
2025-12-23 11:17 ` Prasad Pandit
2025-12-23 15:04 ` Fabiano Rosas [this message]
2026-01-06 10:54 ` Prasad Pandit
2026-01-06 14:53 ` Peter Xu
2025-12-23 15:30 ` Peter Xu
2025-12-23 16:01 ` Fabiano Rosas
2026-01-06 11:45 ` Prasad Pandit
2026-01-06 13:47 ` Fabiano Rosas
2026-01-07 10:49 ` Prasad Pandit
2026-01-07 13:23 ` Fabiano Rosas
2026-01-12 13:05 ` Prasad Pandit
2026-01-12 19:45 ` Fabiano Rosas
2026-01-13 11:59 ` Prasad Pandit
2026-01-14 12:16 ` Fabiano Rosas
2026-01-06 15:09 ` Peter Xu
2026-01-07 11:11 ` Prasad Pandit
2026-01-07 17:03 ` Peter Xu
2025-12-23 15:17 ` Peter Xu
2025-12-23 15:36 ` Fabiano Rosas
2025-12-23 16:27 ` Peter Xu
2026-01-06 10:49 ` Prasad Pandit
2026-01-06 15:23 ` Peter Xu
2026-01-07 11:01 ` Prasad Pandit
2026-01-07 17:07 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87cy45kzo2.fsf@suse.de \
--to=farosas@suse.de \
--cc=peterx@redhat.com \
--cc=pjp@fedoraproject.org \
--cc=ppandit@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.