All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Trieu Huynh <vikingtc4@gmail.com>, qemu-devel@nongnu.org
Cc: Trieu Huynh <vikingtc4@gmail.com>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
Date: Wed, 22 Apr 2026 19:30:47 -0300	[thread overview]
Message-ID: <87o6jaeig8.fsf@suse.de> (raw)
In-Reply-To: <20260422161202.34150-2-viking4@gmail.com>

Trieu Huynh <vikingtc4@gmail.com> writes:

> From: Trieu Huynh <vikingtc4@gmail.com>
>
> When a multifd migration is cancelled and the user changes
> multifd-channels via QMP before cleanup completes, the shutdown and
> termination loops re-read migrate_multifd_channels() which now returns
> the new value.

Right, so this is because migrate-set-parameters is allowed to set
so-called (by me) runtime options, such as downtime-limit, which means
we cannot block it while migration_is_running() = true as we do for
migrate-set-capabilities. The "right" fix here is something I discussed
with Peter a while back, which is to write a whitelist of commands that
we're certain have no negative effect during migration runtime (or are
simply required as part of normal functioning) and block everything else
behind a migration_is_running() check.

Still, I think we can consider this patch in isolation for now... Let me
continue looking.

> This causes the loops to iterate over, for instance
> fewer channels than were created, leaving yank functions of the
> abandoned channels still registered when yank_unregister_instance()
> is called, triggering an abort:
>   qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
>   Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
>   Aborted (core dumped)

Ah yes, the assert machine doing it's job as usual.

>
> Fix by storing the channel count at setup time and using that frozen
> value in all subsequent loops. The live parameter
> migrate_multifd_channels() is now only read once during setup, ensuring
> teardown always operates on the exact set of channels that were created.
>

Take a look at multifd_send(), there's some shenanigans there as well
regarding changing the number of channels on the fly. Could we drop that
logic with this patch?

> Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
> ---
>  migration/multifd.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)

Hmm, I see 20 instances of migrate_multifd_channels() being used in
multifd.c. It seems you missed some.

>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 035cb70f7b..69c8f6747b 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -75,6 +75,8 @@ struct {
>      int exiting;
>      /* multifd ops */
>      const MultiFDMethods *ops;
> +    /* number of channels created (fixed at setup) */
> +    int channel_num;

Reads like "channel number" to me. As in "the number of the
channel". I'd use n_channels, num_channels or channels_num.

Naming aside... we'll then have three variables representing number of
multifd channels:

s->parameters.multifd_channels
multifd_send_state->channel_num
multifd_recv_state->channel_num

(or just 2 and inconsistent representation between send/recv, which is
worse IMO)

Let's go back to the core issue I described at the start, could we
instead check at migrate_params_test_apply() whether migration is
running and return an error when trying to change multifd channels?

There might be issues with current_migration going away while QMP is
still dispatching, but I'm not sure it will be productive if we start to
solve locally the troubles caused by each parameter when changed at
migration runtime.

>  } *multifd_send_state;
>  
>  struct {
> @@ -483,7 +485,7 @@ static void multifd_send_terminate_threads(void)
>       * Firstly, kick all threads out; no matter whether they are just idle,
>       * or blocked in an IO system call.
>       */
> -    for (i = 0; i < migrate_multifd_channels(); i++) {
> +    for (i = 0; i < multifd_send_state->channel_num; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
>          qemu_sem_post(&p->sem);
> @@ -495,7 +497,7 @@ static void multifd_send_terminate_threads(void)
>      /*
>       * Finally recycle all the threads.
>       */
> -    for (i = 0; i < migrate_multifd_channels(); i++) {
> +    for (i = 0; i < multifd_send_state->channel_num; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
>          if (p->tls_thread_created) {
> @@ -577,7 +579,7 @@ void multifd_send_shutdown(void)
>  
>      multifd_send_terminate_threads();
>  
> -    for (i = 0; i < migrate_multifd_channels(); i++) {
> +    for (i = 0; i < multifd_send_state->channel_num; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>          Error *local_err = NULL;
>  
> @@ -615,7 +617,7 @@ int multifd_send_sync_main(MultiFDSyncReq req)
>  
>      flush_zero_copy = migrate_zero_copy_send();
>  
> -    for (i = 0; i < migrate_multifd_channels(); i++) {
> +    for (i = 0; i < multifd_send_state->channel_num; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
>          if (multifd_send_should_exit()) {
> @@ -632,7 +634,7 @@ int multifd_send_sync_main(MultiFDSyncReq req)
>          qatomic_set(&p->pending_sync, req);
>          qemu_sem_post(&p->sem);
>      }
> -    for (i = 0; i < migrate_multifd_channels(); i++) {
> +    for (i = 0; i < multifd_send_state->channel_num; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
>          if (multifd_send_should_exit()) {
> @@ -926,6 +928,7 @@ bool multifd_send_setup(void)
>      thread_count = migrate_multifd_channels();
>      multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
>      multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> +    multifd_send_state->channel_num = thread_count;
>      qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
>      qemu_sem_init(&multifd_send_state->channels_created, 0);
>      qemu_sem_init(&multifd_send_state->channels_ready, 0);


  reply	other threads:[~2026-04-22 22:31 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22 16:12 [PATCH 0/1] migration/multifd: fix channel count TOCTOU race on cancel and retry Trieu Huynh
2026-04-22 16:12 ` [PATCH 1/1] " Trieu Huynh
2026-04-22 22:30   ` Fabiano Rosas [this message]
2026-04-23 16:14     ` Peter Xu
2026-04-23 18:13       ` Fabiano Rosas
2026-04-23 18:44         ` Peter Xu
2026-04-23 19:41           ` Fabiano Rosas
2026-04-24 13:53             ` Peter Xu
2026-04-24 14:15               ` Fabiano Rosas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o6jaeig8.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=vikingtc4@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.