From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, "Avihai Horon" <avihaih@nvidia.com>,
"Daniel P . Berrangé" <berrange@redhat.com>
Subject: Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races
Date: Wed, 07 Feb 2024 10:31:51 -0300 [thread overview]
Message-ID: <878r3w7648.fsf@suse.de> (raw)
In-Reply-To: <ZcLrF5HN920rUTSw@x1n>
Peter Xu <peterx@redhat.com> writes:
> On Tue, Feb 06, 2024 at 06:51:12PM -0300, Fabiano Rosas wrote:
>> Based-on: 20240202102857.110210-1-peterx@redhat.com
>> [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups
>> https://lore.kernel.org/r/20240202102857.110210-1-peterx@redhat.com
>>
>> Hi,
>>
>> For v3 I fixed the refcounting issue spotted by Avihai. The situation
>> there is a bit clunky due to historical reasons. The gist is that we
>> have an assumption that channel creation never fails after p->c has
>> been set, so when 'p->c == NULL' we have to unref and when 'p->c !=
>> NULL' the cleanup code will do the unref.
>
> Yes, this looks good to me. That's a good catch.
>
> I'll leave at least one more day for Avihai and/or Dan to have another
> look. My r-b persist as of now on patch 5.
>
> Actually I think the conditional unref is slightly tricky, but it's not its
> own fault, IMHO, OTOH it's more about a1af605bd5ad where p->c is slightly
> abused. My understanding is we can avoid that conditional unref with below
> patch 1 as a cleanup (on top of this series). Then patch 2 comes all
> alongside.
Yes, I even wrote some code to always set p->c and leave the unref to
the cleanup routine. Doing reference counting in the middle of the code
like that leaves us exposed to new code relying on p->c being valid
during cleanup. There's already yank and the cleanup itself which expect
p->c to be valid.
However, I couldn't get past the uglyness of overwriting p->c, so I kept
the changes minimal for this version.
I'm also wondering whether we can remove the TLS handshake thread
altogether now that we moved the multifd_send_setup call into the
migration thread. My (poor) understanding is that a1af605bd5ad hit the
issue that the QIOTask completion would just execute after
multifd_save_setup returned. Otherwise I don't see how adding a thread
to an already async task would have helped. But I need to think about
that a bit more.
>
> We don't need to rush on these, though, we should fix the thread race
>first because multiple of us hit it, and all cleanups can be done
>later.
I said we should not merge these two changes right now, but I take that
back. I'll leave it up to you, there doesn't seem to be that much impact
in adding them.
>
> =====
> From 0830819d86e08c5175d6669505aa712a0a09717f Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 7 Feb 2024 10:08:35 +0800
> Subject: [PATCH 1/2] migration/multifd: Cleanup TLS iochannel referencing
>
> Commit a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to
> blocking handshake") introduced a thread for TLS channels, which will
> resolve the issue on blocking the main thread. However in the same commit
> p->c is slightly abused just to be able to pass over the pointer "p" into
> the thread.
>
> That's the major reason we'll need to conditionally free the io channel in
> the fault paths.
>
> To clean it up, using a separate structure to pass over both "p" and "tioc"
> in the tls handshake thread. Then we can make it a rule that p->c will
> never be set until the channel is completely setup. With that, we can drop
> the tricky conditional unref of the io channel in the error path.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/multifd.c | 37 +++++++++++++++++++++++--------------
> 1 file changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index adfe8c9a0a..4a85a6b7b3 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -873,16 +873,22 @@ out:
>
> static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
>
> +typedef struct {
> + MultiFDSendParams *p;
> + QIOChannelTLS *tioc;
> +} MultiFDTLSThreadArgs;
> +
> static void *multifd_tls_handshake_thread(void *opaque)
> {
> - MultiFDSendParams *p = opaque;
> - QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
> + MultiFDTLSThreadArgs *args = opaque;
>
> - qio_channel_tls_handshake(tioc,
> + qio_channel_tls_handshake(args->tioc,
> multifd_new_send_channel_async,
> - p,
> + args->p,
> NULL,
> NULL);
> + g_free(args);
> +
> return NULL;
> }
>
> @@ -892,6 +898,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
> {
> MigrationState *s = migrate_get_current();
> const char *hostname = s->hostname;
> + MultiFDTLSThreadArgs *args;
> QIOChannelTLS *tioc;
>
> tioc = migration_tls_client_create(ioc, hostname, errp);
> @@ -906,11 +913,14 @@ static bool multifd_tls_channel_connect(MultiFDSendParams *p,
> object_unref(OBJECT(ioc));
> trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
> qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
> - p->c = QIO_CHANNEL(tioc);
This assignment also meant multifd_send_channel_destroy() would call
object_unref on the tioc object. Removing it means
qio_channel_tls_finalize() will never be called.
It also means the socket channel (ioc) refcount will be decremented one
too many times, due to the object_unref above^.
So I think we should find a point where tioc is not needed anymore and
unref it and remove the object_unref(ioc) above.
Right?
> +
> + args = g_new0(MultiFDTLSThreadArgs, 1);
> + args->tioc = tioc;
> + args->p = p;
>
> p->tls_thread_created = true;
> qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
> - multifd_tls_handshake_thread, p,
> + multifd_tls_handshake_thread, args,
> QEMU_THREAD_JOINABLE);
> return true;
> }
> @@ -923,6 +933,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>
> migration_ioc_register_yank(ioc);
> p->registered_yank = true;
> + /* Setup p->c only if the channel is completely setup */
> p->c = ioc;
>
> p->thread_created = true;
> @@ -976,14 +987,12 @@ out:
>
> trace_multifd_new_send_channel_async_error(p->id, local_err);
> multifd_send_set_error(local_err);
> - if (!p->c) {
> - /*
> - * If no channel has been created, drop the initial
> - * reference. Otherwise cleanup happens at
> - * multifd_send_channel_destroy()
> - */
> - object_unref(OBJECT(ioc));
> - }
> + /*
> + * For error cases (TLS or non-TLS), IO channel is always freed here
> + * rather than when cleanup multifd: since p->c is not set, multifd
> + * cleanup code doesn't even know its existance.
> + */
> + object_unref(OBJECT(ioc));
> error_free(local_err);
> }
>
> --
> 2.43.0
> =====
> From 9e574c3216f6459e3a808096d905e2554d962cad Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 7 Feb 2024 10:24:39 +0800
> Subject: [PATCH 2/2] migration/multifd: Drop registered_yank
>
> With a clear definition of p->c protocol, where we only set it up if the
> channel is fully established (TLS or non-TLS), registered_yank boolean will
> have equal meaning of "p->c != NULL".
>
> Drop registered_yank by checking p->c instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
This one looks good. I know it depends on the previous patch, but if you
plan to add it:
Reviewed-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/multifd.h | 2 --
> migration/multifd.c | 7 +++----
> 2 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 8a1cad0996..b3fe27ae93 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -78,8 +78,6 @@ typedef struct {
> bool tls_thread_created;
> /* communication channel */
> QIOChannel *c;
> - /* is the yank function registered */
> - bool registered_yank;
> /* packet allocated len */
> uint32_t packet_len;
> /* guest page size */
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 4a85a6b7b3..278453cf84 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -648,11 +648,11 @@ static int multifd_send_channel_destroy(QIOChannel *send)
>
> static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
> {
> - if (p->registered_yank) {
> + if (p->c) {
> migration_ioc_unregister_yank(p->c);
> + multifd_send_channel_destroy(p->c);
> + p->c = NULL;
> }
> - multifd_send_channel_destroy(p->c);
> - p->c = NULL;
> qemu_sem_destroy(&p->sem);
> qemu_sem_destroy(&p->sem_sync);
> g_free(p->name);
> @@ -932,7 +932,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
> qio_channel_set_delay(ioc, false);
>
> migration_ioc_register_yank(ioc);
> - p->registered_yank = true;
> /* Setup p->c only if the channel is completely setup */
> p->c = ioc;
>
> --
> 2.43.0
> ====
>
> Thanks,
next prev parent reply other threads:[~2024-02-07 13:32 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-06 21:51 [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races Fabiano Rosas
2024-02-06 21:51 ` [PATCH v3 1/6] migration/multifd: Join the TLS thread Fabiano Rosas
2024-02-06 21:51 ` [PATCH v3 2/6] migration/multifd: Remove p->running Fabiano Rosas
2024-02-06 21:51 ` [PATCH v3 3/6] migration/multifd: Move multifd_send_setup error handling in to the function Fabiano Rosas
2024-02-06 21:51 ` [PATCH v3 4/6] migration/multifd: Move multifd_send_setup into migration thread Fabiano Rosas
2024-02-06 21:51 ` [PATCH v3 5/6] migration/multifd: Unify multifd and TLS connection paths Fabiano Rosas
2024-02-06 21:51 ` [PATCH v3 6/6] migration/multifd: Add a synchronization point for channel creation Fabiano Rosas
2024-02-07 2:29 ` [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races Peter Xu
2024-02-07 13:31 ` Fabiano Rosas [this message]
2024-02-08 3:32 ` Peter Xu
2024-02-08 3:01 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878r3w7648.fsf@suse.de \
--to=farosas@suse.de \
--cc=avihaih@nvidia.com \
--cc=berrange@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.