From: Peter Xu <peterx@redhat.com>
To: Fabiano Rosas <farosas@suse.de>
Cc: Trieu Huynh <vikingtc4@gmail.com>, qemu-devel@nongnu.org
Subject: Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
Date: Thu, 23 Apr 2026 12:14:55 -0400 [thread overview]
Message-ID: <aepFf11lBgdXK6CB@x1.local> (raw)
In-Reply-To: <87o6jaeig8.fsf@suse.de>
On Wed, Apr 22, 2026 at 07:30:47PM -0300, Fabiano Rosas wrote:
> Trieu Huynh <vikingtc4@gmail.com> writes:
>
> > From: Trieu Huynh <vikingtc4@gmail.com>
> >
> > When a multifd migration is cancelled and the user changes
> > multifd-channels via QMP before cleanup completes, the shutdown and
> > termination loops re-read migrate_multifd_channels() which now returns
> > the new value.
>
> Right, so this is because migrate-set-parameters is allowed to set
> so-called (by me) runtime options, such as downtime-limit, which means
I like this new name, if we ever need a name for such..
> we cannot block it while migration_is_running() = true as we do for
> migrate-set-capabilities. The "right" fix here is something I discussed
> with Peter a while back, which is to write a whitelist of commands that
> we're certain have no negative effect during migration runtime (or are
> simply required as part of normal functioning) and block everything else
> behind a migration_is_running() check.
>
> Still, I think we can consider this patch in isolation for now... Let me
> continue looking.
>
> > This causes the loops to iterate over, for instance
> > fewer channels than were created, leaving yank functions of the
> > abandoned channels still registered when yank_unregister_instance()
> > is called, triggering an abort:
> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
> > Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
> > Aborted (core dumped)
>
> Ah yes, the assert machine doing it's job as usual.
>
> >
> > Fix by storing the channel count at setup time and using that frozen
> > value in all subsequent loops. The live parameter
> > migrate_multifd_channels() is now only read once during setup, ensuring
> > teardown always operates on the exact set of channels that were created.
> >
>
> Take a look at multifd_send(), there's some shenanigans there as well
> regarding changing the number of channels on the fly. Could we drop that
> logic with this patch?
I don't know anything allows dynamic number of channels. If it's about:
/*
* next_channel can remain from a previous migration that was
* using more channels, so ensure it doesn't overflow if the
* limit is lower now.
*/
It's about another migration after a failed/cancelled migration only, where
next_channel is currently a static variable.
>
> > Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
> > ---
> > migration/multifd.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
>
> Hmm, I see 20 instances of migrate_multifd_channels() being used in
> multifd.c. It seems you missed some.
>
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 035cb70f7b..69c8f6747b 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -75,6 +75,8 @@ struct {
> > int exiting;
> > /* multifd ops */
> > const MultiFDMethods *ops;
> > + /* number of channels created (fixed at setup) */
> > + int channel_num;
>
> Reads like "channel number" to me. As in "the number of the
> channel". I'd use n_channels, num_channels or channels_num.
>
> Naming aside... we'll then have three variables representing number of
> multifd channels:
>
> s->parameters.multifd_channels
> multifd_send_state->channel_num
> multifd_recv_state->channel_num
>
> (or just 2 and inconsistent representation between send/recv, which is
> worse IMO)
>
> Let's go back to the core issue I described at the start, could we
> instead check at migrate_params_test_apply() whether migration is
> running and return an error when trying to change multifd channels?
>
> There might be issues with current_migration going away while QMP is
> still dispatching, but I'm not sure it will be productive if we start to
> solve locally the troubles caused by each parameter when changed at
> migration runtime.
Agreed, I think we should fix it with the whitelist idea.
If you haven't started looking at this (??), I wonder if Trieu would like
to look at it as a replacement of this patch.
--
Peter Xu
next prev parent reply other threads:[~2026-04-23 16:15 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 16:12 [PATCH 0/1] migration/multifd: fix channel count TOCTOU race on cancel and retry Trieu Huynh
2026-04-22 16:12 ` [PATCH 1/1] " Trieu Huynh
2026-04-22 22:30 ` Fabiano Rosas
2026-04-23 16:14 ` Peter Xu [this message]
2026-04-23 18:13 ` Fabiano Rosas
2026-04-23 18:44 ` Peter Xu
2026-04-23 19:41 ` Fabiano Rosas
2026-04-24 13:53 ` Peter Xu
2026-04-24 14:15 ` Fabiano Rosas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aepFf11lBgdXK6CB@x1.local \
--to=peterx@redhat.com \
--cc=farosas@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=vikingtc4@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.