* [PATCH 0/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
@ 2026-04-22 16:12 Trieu Huynh
2026-04-22 16:12 ` [PATCH 1/1] " Trieu Huynh
0 siblings, 1 reply; 9+ messages in thread
From: Trieu Huynh @ 2026-04-22 16:12 UTC (permalink / raw)
To: qemu-devel; +Cc: Trieu Huynh
From: Trieu Huynh <vikingtc4@gmail.com>
QEMU aborts on src during multifd migration when multifd-channels is
changed after migrate_cancel:
~ # qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
Aborted (core dumped)
Below is the backtrace:
#7 yank_unregister_instance (instance=0x7fffffffd870) at ../util/yank.c:107
#8 migration_cleanup (s=0x555557649d10) at ../migration/migration.c:1338
#9 migration_cleanup_bh (opaque=0x555557649d10) at ../migration/migration.c:1345
#10 migration_bh_dispatch_bh (opaque=0x7ffe500051d0) at ../migration/migration.c:350
#11 aio_bh_call (bh=0x7ffe5006d670) at ../util/async.c:173
#12 aio_bh_poll (ctx=0x555557649210) at ../util/async.c:220
#13 aio_dispatch (ctx=0x555557649210) at ../util/aio-posix.c:390
#14 aio_ctx_dispatch (source=0x555557649210) at ../util/async.c:365
...
Steps to reproduce:
* 1. Start VM on src
./qemu-system-x86_64 \
-<other_options>
* 2. Start dest with -incoming defer
./qemu-system-x86_64 \
-incoming defer \
-<other_options>
* 3. Enable multifd with 4 channels on both src and dest
{ "execute": "qmp_capabilities" }
{ "execute": "migrate-set-capabilities",
"arguments": {"capabilities": [{"capability": "multifd",
"state": true}]} }
{ "execute": "migrate-set-parameters",
"arguments": {"multifd-channels": 4} }
{ "execute": "migrate-incoming",
"arguments": {"uri": "tcp:0:4444"} }
* 4. Trigger migration
** dest:
{ "execute": "migrate-incoming", "arguments": {"uri": "tcp:0:4444"} }
** src:
{ "execute": "migrate", "arguments": {"uri": "tcp:127.0.0.1:4444"} }
* 5. Cancel migration on src while status is active
{ "execute": "migrate_cancel" }
* 6. Immediately change multifd-channels to 2 on src
{ "execute": "migrate-set-parameters",
"arguments": {"multifd-channels": 2} }
QEMU aborts as shown above
Trieu Huynh (1):
migration/multifd: fix channel count TOCTOU race on cancel and retry
migration/multifd.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-22 16:12 [PATCH 0/1] migration/multifd: fix channel count TOCTOU race on cancel and retry Trieu Huynh
@ 2026-04-22 16:12 ` Trieu Huynh
2026-04-22 22:30 ` Fabiano Rosas
0 siblings, 1 reply; 9+ messages in thread
From: Trieu Huynh @ 2026-04-22 16:12 UTC (permalink / raw)
To: qemu-devel; +Cc: Trieu Huynh, Peter Xu, Fabiano Rosas
From: Trieu Huynh <vikingtc4@gmail.com>
When a multifd migration is cancelled and the user changes
multifd-channels via QMP before cleanup completes, the shutdown and
termination loops re-read migrate_multifd_channels() which now returns
the new value. This causes the loops to iterate over, for instance
fewer channels than were created, leaving yank functions of the
abandoned channels still registered when yank_unregister_instance()
is called, triggering an abort:
qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
Aborted (core dumped)
Fix by storing the channel count at setup time and using that frozen
value in all subsequent loops. The live parameter
migrate_multifd_channels() is now only read once during setup, ensuring
teardown always operates on the exact set of channels that were created.
Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
---
migration/multifd.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index 035cb70f7b..69c8f6747b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -75,6 +75,8 @@ struct {
int exiting;
/* multifd ops */
const MultiFDMethods *ops;
+ /* number of channels created (fixed at setup) */
+ int channel_num;
} *multifd_send_state;
struct {
@@ -483,7 +485,7 @@ static void multifd_send_terminate_threads(void)
* Firstly, kick all threads out; no matter whether they are just idle,
* or blocked in an IO system call.
*/
- for (i = 0; i < migrate_multifd_channels(); i++) {
+ for (i = 0; i < multifd_send_state->channel_num; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
qemu_sem_post(&p->sem);
@@ -495,7 +497,7 @@ static void multifd_send_terminate_threads(void)
/*
* Finally recycle all the threads.
*/
- for (i = 0; i < migrate_multifd_channels(); i++) {
+ for (i = 0; i < multifd_send_state->channel_num; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
if (p->tls_thread_created) {
@@ -577,7 +579,7 @@ void multifd_send_shutdown(void)
multifd_send_terminate_threads();
- for (i = 0; i < migrate_multifd_channels(); i++) {
+ for (i = 0; i < multifd_send_state->channel_num; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
Error *local_err = NULL;
@@ -615,7 +617,7 @@ int multifd_send_sync_main(MultiFDSyncReq req)
flush_zero_copy = migrate_zero_copy_send();
- for (i = 0; i < migrate_multifd_channels(); i++) {
+ for (i = 0; i < multifd_send_state->channel_num; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
if (multifd_send_should_exit()) {
@@ -632,7 +634,7 @@ int multifd_send_sync_main(MultiFDSyncReq req)
qatomic_set(&p->pending_sync, req);
qemu_sem_post(&p->sem);
}
- for (i = 0; i < migrate_multifd_channels(); i++) {
+ for (i = 0; i < multifd_send_state->channel_num; i++) {
MultiFDSendParams *p = &multifd_send_state->params[i];
if (multifd_send_should_exit()) {
@@ -926,6 +928,7 @@ bool multifd_send_setup(void)
thread_count = migrate_multifd_channels();
multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
+ multifd_send_state->channel_num = thread_count;
qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
qemu_sem_init(&multifd_send_state->channels_created, 0);
qemu_sem_init(&multifd_send_state->channels_ready, 0);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-22 16:12 ` [PATCH 1/1] " Trieu Huynh
@ 2026-04-22 22:30 ` Fabiano Rosas
2026-04-23 16:14 ` Peter Xu
0 siblings, 1 reply; 9+ messages in thread
From: Fabiano Rosas @ 2026-04-22 22:30 UTC (permalink / raw)
To: Trieu Huynh, qemu-devel; +Cc: Trieu Huynh, Peter Xu
Trieu Huynh <vikingtc4@gmail.com> writes:
> From: Trieu Huynh <vikingtc4@gmail.com>
>
> When a multifd migration is cancelled and the user changes
> multifd-channels via QMP before cleanup completes, the shutdown and
> termination loops re-read migrate_multifd_channels() which now returns
> the new value.
Right, so this is because migrate-set-parameters is allowed to set
so-called (by me) runtime options, such as downtime-limit, which means
we cannot block it while migration_is_running() = true as we do for
migrate-set-capabilities. The "right" fix here is something I discussed
with Peter a while back, which is to write a whitelist of commands that
we're certain have no negative effect during migration runtime (or are
simply required as part of normal functioning) and block everything else
behind a migration_is_running() check.
Still, I think we can consider this patch in isolation for now... Let me
continue looking.
> This causes the loops to iterate over, for instance
> fewer channels than were created, leaving yank functions of the
> abandoned channels still registered when yank_unregister_instance()
> is called, triggering an abort:
> qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
> Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
> Aborted (core dumped)
Ah yes, the assert machine doing it's job as usual.
>
> Fix by storing the channel count at setup time and using that frozen
> value in all subsequent loops. The live parameter
> migrate_multifd_channels() is now only read once during setup, ensuring
> teardown always operates on the exact set of channels that were created.
>
Take a look at multifd_send(), there's some shenanigans there as well
regarding changing the number of channels on the fly. Could we drop that
logic with this patch?
> Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
> ---
> migration/multifd.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
Hmm, I see 20 instances of migrate_multifd_channels() being used in
multifd.c. It seems you missed some.
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 035cb70f7b..69c8f6747b 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -75,6 +75,8 @@ struct {
> int exiting;
> /* multifd ops */
> const MultiFDMethods *ops;
> + /* number of channels created (fixed at setup) */
> + int channel_num;
Reads like "channel number" to me. As in "the number of the
channel". I'd use n_channels, num_channels or channels_num.
Naming aside... we'll then have three variables representing number of
multifd channels:
s->parameters.multifd_channels
multifd_send_state->channel_num
multifd_recv_state->channel_num
(or just 2 and inconsistent representation between send/recv, which is
worse IMO)
Let's go back to the core issue I described at the start, could we
instead check at migrate_params_test_apply() whether migration is
running and return an error when trying to change multifd channels?
There might be issues with current_migration going away while QMP is
still dispatching, but I'm not sure it will be productive if we start to
solve locally the troubles caused by each parameter when changed at
migration runtime.
> } *multifd_send_state;
>
> struct {
> @@ -483,7 +485,7 @@ static void multifd_send_terminate_threads(void)
> * Firstly, kick all threads out; no matter whether they are just idle,
> * or blocked in an IO system call.
> */
> - for (i = 0; i < migrate_multifd_channels(); i++) {
> + for (i = 0; i < multifd_send_state->channel_num; i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
>
> qemu_sem_post(&p->sem);
> @@ -495,7 +497,7 @@ static void multifd_send_terminate_threads(void)
> /*
> * Finally recycle all the threads.
> */
> - for (i = 0; i < migrate_multifd_channels(); i++) {
> + for (i = 0; i < multifd_send_state->channel_num; i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
>
> if (p->tls_thread_created) {
> @@ -577,7 +579,7 @@ void multifd_send_shutdown(void)
>
> multifd_send_terminate_threads();
>
> - for (i = 0; i < migrate_multifd_channels(); i++) {
> + for (i = 0; i < multifd_send_state->channel_num; i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
> Error *local_err = NULL;
>
> @@ -615,7 +617,7 @@ int multifd_send_sync_main(MultiFDSyncReq req)
>
> flush_zero_copy = migrate_zero_copy_send();
>
> - for (i = 0; i < migrate_multifd_channels(); i++) {
> + for (i = 0; i < multifd_send_state->channel_num; i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
>
> if (multifd_send_should_exit()) {
> @@ -632,7 +634,7 @@ int multifd_send_sync_main(MultiFDSyncReq req)
> qatomic_set(&p->pending_sync, req);
> qemu_sem_post(&p->sem);
> }
> - for (i = 0; i < migrate_multifd_channels(); i++) {
> + for (i = 0; i < multifd_send_state->channel_num; i++) {
> MultiFDSendParams *p = &multifd_send_state->params[i];
>
> if (multifd_send_should_exit()) {
> @@ -926,6 +928,7 @@ bool multifd_send_setup(void)
> thread_count = migrate_multifd_channels();
> multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
> multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> + multifd_send_state->channel_num = thread_count;
> qemu_mutex_init(&multifd_send_state->multifd_send_mutex);
> qemu_sem_init(&multifd_send_state->channels_created, 0);
> qemu_sem_init(&multifd_send_state->channels_ready, 0);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-22 22:30 ` Fabiano Rosas
@ 2026-04-23 16:14 ` Peter Xu
2026-04-23 18:13 ` Fabiano Rosas
0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2026-04-23 16:14 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Trieu Huynh, qemu-devel
On Wed, Apr 22, 2026 at 07:30:47PM -0300, Fabiano Rosas wrote:
> Trieu Huynh <vikingtc4@gmail.com> writes:
>
> > From: Trieu Huynh <vikingtc4@gmail.com>
> >
> > When a multifd migration is cancelled and the user changes
> > multifd-channels via QMP before cleanup completes, the shutdown and
> > termination loops re-read migrate_multifd_channels() which now returns
> > the new value.
>
> Right, so this is because migrate-set-parameters is allowed to set
> so-called (by me) runtime options, such as downtime-limit, which means
I like this new name, if we ever need a name for such..
> we cannot block it while migration_is_running() = true as we do for
> migrate-set-capabilities. The "right" fix here is something I discussed
> with Peter a while back, which is to write a whitelist of commands that
> we're certain have no negative effect during migration runtime (or are
> simply required as part of normal functioning) and block everything else
> behind a migration_is_running() check.
>
> Still, I think we can consider this patch in isolation for now... Let me
> continue looking.
>
> > This causes the loops to iterate over, for instance
> > fewer channels than were created, leaving yank functions of the
> > abandoned channels still registered when yank_unregister_instance()
> > is called, triggering an abort:
> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
> > Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
> > Aborted (core dumped)
>
> Ah yes, the assert machine doing it's job as usual.
>
> >
> > Fix by storing the channel count at setup time and using that frozen
> > value in all subsequent loops. The live parameter
> > migrate_multifd_channels() is now only read once during setup, ensuring
> > teardown always operates on the exact set of channels that were created.
> >
>
> Take a look at multifd_send(), there's some shenanigans there as well
> regarding changing the number of channels on the fly. Could we drop that
> logic with this patch?
I don't know anything allows dynamic number of channels. If it's about:
/*
* next_channel can remain from a previous migration that was
* using more channels, so ensure it doesn't overflow if the
* limit is lower now.
*/
It's about another migration after a failed/cancelled migration only, where
next_channel is currently a static variable.
>
> > Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
> > ---
> > migration/multifd.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
>
> Hmm, I see 20 instances of migrate_multifd_channels() being used in
> multifd.c. It seems you missed some.
>
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 035cb70f7b..69c8f6747b 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -75,6 +75,8 @@ struct {
> > int exiting;
> > /* multifd ops */
> > const MultiFDMethods *ops;
> > + /* number of channels created (fixed at setup) */
> > + int channel_num;
>
> Reads like "channel number" to me. As in "the number of the
> channel". I'd use n_channels, num_channels or channels_num.
>
> Naming aside... we'll then have three variables representing number of
> multifd channels:
>
> s->parameters.multifd_channels
> multifd_send_state->channel_num
> multifd_recv_state->channel_num
>
> (or just 2 and inconsistent representation between send/recv, which is
> worse IMO)
>
> Let's go back to the core issue I described at the start, could we
> instead check at migrate_params_test_apply() whether migration is
> running and return an error when trying to change multifd channels?
>
> There might be issues with current_migration going away while QMP is
> still dispatching, but I'm not sure it will be productive if we start to
> solve locally the troubles caused by each parameter when changed at
> migration runtime.
Agreed, I think we should fix it with the whitelist idea.
If you haven't started looking at this (??), I wonder if Trieu would like
to look at it as a replacement of this patch.
--
Peter Xu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-23 16:14 ` Peter Xu
@ 2026-04-23 18:13 ` Fabiano Rosas
2026-04-23 18:44 ` Peter Xu
0 siblings, 1 reply; 9+ messages in thread
From: Fabiano Rosas @ 2026-04-23 18:13 UTC (permalink / raw)
To: Peter Xu; +Cc: Trieu Huynh, qemu-devel
Peter Xu <peterx@redhat.com> writes:
> On Wed, Apr 22, 2026 at 07:30:47PM -0300, Fabiano Rosas wrote:
>> Trieu Huynh <vikingtc4@gmail.com> writes:
>>
>> > From: Trieu Huynh <vikingtc4@gmail.com>
>> >
>> > When a multifd migration is cancelled and the user changes
>> > multifd-channels via QMP before cleanup completes, the shutdown and
>> > termination loops re-read migrate_multifd_channels() which now returns
>> > the new value.
>>
>> Right, so this is because migrate-set-parameters is allowed to set
>> so-called (by me) runtime options, such as downtime-limit, which means
>
> I like this new name, if we ever need a name for such..
>
>> we cannot block it while migration_is_running() = true as we do for
>> migrate-set-capabilities. The "right" fix here is something I discussed
>> with Peter a while back, which is to write a whitelist of commands that
>> we're certain have no negative effect during migration runtime (or are
>> simply required as part of normal functioning) and block everything else
>> behind a migration_is_running() check.
>>
>> Still, I think we can consider this patch in isolation for now... Let me
>> continue looking.
>>
>> > This causes the loops to iterate over, for instance
>> > fewer channels than were created, leaving yank functions of the
>> > abandoned channels still registered when yank_unregister_instance()
>> > is called, triggering an abort:
>> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance:
>> > Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
>> > Aborted (core dumped)
>>
>> Ah yes, the assert machine doing it's job as usual.
>>
>> >
>> > Fix by storing the channel count at setup time and using that frozen
>> > value in all subsequent loops. The live parameter
>> > migrate_multifd_channels() is now only read once during setup, ensuring
>> > teardown always operates on the exact set of channels that were created.
>> >
>>
>> Take a look at multifd_send(), there's some shenanigans there as well
>> regarding changing the number of channels on the fly. Could we drop that
>> logic with this patch?
>
> I don't know anything allows dynamic number of channels. If it's about:
>
> /*
> * next_channel can remain from a previous migration that was
> * using more channels, so ensure it doesn't overflow if the
> * limit is lower now.
> */
>
> It's about another migration after a failed/cancelled migration only, where
> next_channel is currently a static variable.
>
Ah, right. Nevermind, then.
>>
>> > Signed-off-by: Trieu Huynh <vikingtc4@gmail.com>
>> > ---
>> > migration/multifd.c | 13 ++++++++-----
>> > 1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> Hmm, I see 20 instances of migrate_multifd_channels() being used in
>> multifd.c. It seems you missed some.
>>
>> >
>> > diff --git a/migration/multifd.c b/migration/multifd.c
>> > index 035cb70f7b..69c8f6747b 100644
>> > --- a/migration/multifd.c
>> > +++ b/migration/multifd.c
>> > @@ -75,6 +75,8 @@ struct {
>> > int exiting;
>> > /* multifd ops */
>> > const MultiFDMethods *ops;
>> > + /* number of channels created (fixed at setup) */
>> > + int channel_num;
>>
>> Reads like "channel number" to me. As in "the number of the
>> channel". I'd use n_channels, num_channels or channels_num.
>>
>> Naming aside... we'll then have three variables representing number of
>> multifd channels:
>>
>> s->parameters.multifd_channels
>> multifd_send_state->channel_num
>> multifd_recv_state->channel_num
>>
>> (or just 2 and inconsistent representation between send/recv, which is
>> worse IMO)
>>
Looking again at this argument I put (too many variables), I notice we
also have multifd_send_state->channels_ready and
multifd_recv_state->count, both of which should contain the right number
of channels after multifd_send_setup() and migration_start_incoming(),
respectively. Maybe we could unify all of this into a single semaphore
used in both sides and take the semaphore count as number of
channels. @Peter, do you think it's worth it?
>> Let's go back to the core issue I described at the start, could we
>> instead check at migrate_params_test_apply() whether migration is
>> running and return an error when trying to change multifd channels?
>>
>> There might be issues with current_migration going away while QMP is
>> still dispatching, but I'm not sure it will be productive if we start to
>> solve locally the troubles caused by each parameter when changed at
>> migration runtime.
>
> Agreed, I think we should fix it with the whitelist idea.
>
> If you haven't started looking at this (??), I wonder if Trieu would like
> to look at it as a replacement of this patch.
I haven't started. Feel free to take.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-23 18:13 ` Fabiano Rosas
@ 2026-04-23 18:44 ` Peter Xu
2026-04-23 19:41 ` Fabiano Rosas
0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2026-04-23 18:44 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Trieu Huynh, qemu-devel
On Thu, Apr 23, 2026 at 03:13:42PM -0300, Fabiano Rosas wrote:
> Looking again at this argument I put (too many variables), I notice we
> also have multifd_send_state->channels_ready and
channels_ready seems to be special? In busy systems I think it should
normally always be less than the number of threads on sender side, because
some of them will be busy.
> multifd_recv_state->count, both of which should contain the right number
This one is used so far to just count how many channels are established on
dest side. Logically after setup phase it can be gone. It's just not hurt
to be available until cleanups.
> of channels after multifd_send_setup() and migration_start_incoming(),
> respectively. Maybe we could unify all of this into a single semaphore
> used in both sides and take the semaphore count as number of
> channels. @Peter, do you think it's worth it?
My memory is reading counter from a sem isn't always reliable. I always
only use sem as thread-safety purpose primitives.
So if I was correct on channels_ready above, then it may not apply on the
idea to merge. We're essentially trying to save one 8B for ->count.. and
it can only be gone after it's finished its use (IOW, we still need it to
work before recv side setup).
Just leave them as-is?
--
Peter Xu
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-23 18:44 ` Peter Xu
@ 2026-04-23 19:41 ` Fabiano Rosas
2026-04-24 13:53 ` Peter Xu
0 siblings, 1 reply; 9+ messages in thread
From: Fabiano Rosas @ 2026-04-23 19:41 UTC (permalink / raw)
To: Peter Xu; +Cc: Trieu Huynh, qemu-devel
Peter Xu <peterx@redhat.com> writes:
> On Thu, Apr 23, 2026 at 03:13:42PM -0300, Fabiano Rosas wrote:
>> Looking again at this argument I put (too many variables), I notice we
>> also have multifd_send_state->channels_ready and
>
> channels_ready seems to be special? In busy systems I think it should
> normally always be less than the number of threads on sender side, because
> some of them will be busy.
>
Argh, sorry, I meant channels_created!
>> multifd_recv_state->count, both of which should contain the right number
>
> This one is used so far to just count how many channels are established on
> dest side. Logically after setup phase it can be gone. It's just not hurt
> to be available until cleanups.
>
>> of channels after multifd_send_setup() and migration_start_incoming(),
>> respectively. Maybe we could unify all of this into a single semaphore
>> used in both sides and take the semaphore count as number of
>> channels. @Peter, do you think it's worth it?
>
> My memory is reading counter from a sem isn't always reliable. I always
> only use sem as thread-safety purpose primitives.
>
> So if I was correct on channels_ready above, then it may not apply on the
> idea to merge. We're essentially trying to save one 8B for ->count.. and
> it can only be gone after it's finished its use (IOW, we still need it to
> work before recv side setup).
>
> Just leave them as-is?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-23 19:41 ` Fabiano Rosas
@ 2026-04-24 13:53 ` Peter Xu
2026-04-24 14:15 ` Fabiano Rosas
0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2026-04-24 13:53 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Trieu Huynh, qemu-devel
On Thu, Apr 23, 2026 at 04:41:39PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Apr 23, 2026 at 03:13:42PM -0300, Fabiano Rosas wrote:
> >> Looking again at this argument I put (too many variables), I notice we
> >> also have multifd_send_state->channels_ready and
> >
> > channels_ready seems to be special? In busy systems I think it should
> > normally always be less than the number of threads on sender side, because
> > some of them will be busy.
> >
>
> Argh, sorry, I meant channels_created!
Yeah, this one looks like a slight duplicate over channels_ready. However
it's still slightly different in that it's also used in failure path of
channel establishments.
IOW, multifd_send_channel_created() can be invoked in failure paths where
multifd_channel_connect() won't.
We could still consider reusing channels_ready, but it will introduce a few
complexities, namely:
- The name becomes slightly ambiguous: we may need to listen to
channels_ready even if the channel creation failed.. if to be fair,
channels_created also implies a success.. s I assume not a major concern.
- Multifd sender side relies on channels_ready to be posted by default when
migration just starts (says, "all channels are free to use"). It means
if we consume that sem here waiting for channels, then we need to kick
all threads once more just to give it back, hence one more roundtrip of
sem notifies. We also need a small touchup in send thread to allow that
to happen; patch attached at the end to show what I mean (not tested at
all, please treat it as pesudo code.. so it's definitely not a complete
patch).
I think we can still leave it there to make the establishment path simple.
What's your take?
===8<===
diff --git a/migration/multifd.c b/migration/multifd.c
index 035cb70f7b..570ff8c017 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -736,16 +736,9 @@ static void *multifd_send_thread(void *opaque)
* multifd_send().
*/
qatomic_store_release(&p->pending_job, false);
- } else {
+ } else if (qatomic_read(&p->pending_sync)) {
MultiFDSyncReq req = qatomic_read(&p->pending_sync);
- /*
- * If not a normal job, must be a sync request. Note that
- * pending_sync is a standalone flag (unlike pending_job), so
- * it doesn't require explicit memory barriers.
- */
- assert(req != MULTIFD_SYNC_NONE);
-
/* Only push the SYNC message if it involves a remote sync */
if (req == MULTIFD_SYNC_ALL) {
p->flags = MULTIFD_FLAG_SYNC;
@@ -964,7 +957,14 @@ bool multifd_send_setup(void)
* past this point.
*/
for (i = 0; i < thread_count; i++) {
- qemu_sem_wait(&multifd_send_state->channels_created);
+ MultiFDSendParams *p = &multifd_send_state->params[i];
+
+ qemu_sem_wait(&multifd_send_state->channels_ready);
+ /*
+ * Re-kick the thread to recover the channels_ready event we
+ * consumed for detecting channel establish event.
+ */
+ qemu_sem_post(&p->sem);
}
if (ret) {
--
Peter Xu
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1] migration/multifd: fix channel count TOCTOU race on cancel and retry
2026-04-24 13:53 ` Peter Xu
@ 2026-04-24 14:15 ` Fabiano Rosas
0 siblings, 0 replies; 9+ messages in thread
From: Fabiano Rosas @ 2026-04-24 14:15 UTC (permalink / raw)
To: Peter Xu; +Cc: Trieu Huynh, qemu-devel
Peter Xu <peterx@redhat.com> writes:
> On Thu, Apr 23, 2026 at 04:41:39PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Apr 23, 2026 at 03:13:42PM -0300, Fabiano Rosas wrote:
>> >> Looking again at this argument I put (too many variables), I notice we
>> >> also have multifd_send_state->channels_ready and
>> >
>> > channels_ready seems to be special? In busy systems I think it should
>> > normally always be less than the number of threads on sender side, because
>> > some of them will be busy.
>> >
>>
>> Argh, sorry, I meant channels_created!
>
> Yeah, this one looks like a slight duplicate over channels_ready. However
> it's still slightly different in that it's also used in failure path of
> channel establishments.
>
I'm not suggesting to merge channels_ready and channels_created. I'm
saying that channels_created == migrate_multifd_channels().
> IOW, multifd_send_channel_created() can be invoked in failure paths where
> multifd_channel_connect() won't.
>
Good, we always want to iterate over the number of created channels,
right? But maybe using the semaphore count is really not a good idea,
let's leave it.
> We could still consider reusing channels_ready, but it will introduce a few
> complexities, namely:
>
> - The name becomes slightly ambiguous: we may need to listen to
> channels_ready even if the channel creation failed.. if to be fair,
> channels_created also implies a success.. s I assume not a major concern.
>
> - Multifd sender side relies on channels_ready to be posted by default when
> migration just starts (says, "all channels are free to use"). It means
> if we consume that sem here waiting for channels, then we need to kick
> all threads once more just to give it back, hence one more roundtrip of
> sem notifies. We also need a small touchup in send thread to allow that
> to happen; patch attached at the end to show what I mean (not tested at
> all, please treat it as pesudo code.. so it's definitely not a complete
> patch).
>
> I think we can still leave it there to make the establishment path simple.
>
> What's your take?
>
> ===8<===
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 035cb70f7b..570ff8c017 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -736,16 +736,9 @@ static void *multifd_send_thread(void *opaque)
> * multifd_send().
> */
> qatomic_store_release(&p->pending_job, false);
> - } else {
> + } else if (qatomic_read(&p->pending_sync)) {
> MultiFDSyncReq req = qatomic_read(&p->pending_sync);
>
> - /*
> - * If not a normal job, must be a sync request. Note that
> - * pending_sync is a standalone flag (unlike pending_job), so
> - * it doesn't require explicit memory barriers.
> - */
> - assert(req != MULTIFD_SYNC_NONE);
> -
> /* Only push the SYNC message if it involves a remote sync */
> if (req == MULTIFD_SYNC_ALL) {
> p->flags = MULTIFD_FLAG_SYNC;
> @@ -964,7 +957,14 @@ bool multifd_send_setup(void)
> * past this point.
> */
> for (i = 0; i < thread_count; i++) {
> - qemu_sem_wait(&multifd_send_state->channels_created);
> + MultiFDSendParams *p = &multifd_send_state->params[i];
> +
> + qemu_sem_wait(&multifd_send_state->channels_ready);
> + /*
> + * Re-kick the thread to recover the channels_ready event we
> + * consumed for detecting channel establish event.
> + */
> + qemu_sem_post(&p->sem);
> }
>
> if (ret) {
I think it's risky.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-24 14:16 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 16:12 [PATCH 0/1] migration/multifd: fix channel count TOCTOU race on cancel and retry Trieu Huynh
2026-04-22 16:12 ` [PATCH 1/1] " Trieu Huynh
2026-04-22 22:30 ` Fabiano Rosas
2026-04-23 16:14 ` Peter Xu
2026-04-23 18:13 ` Fabiano Rosas
2026-04-23 18:44 ` Peter Xu
2026-04-23 19:41 ` Fabiano Rosas
2026-04-24 13:53 ` Peter Xu
2026-04-24 14:15 ` Fabiano Rosas
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.