All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Li Zhang <lizhang@suse.de>
Cc: quintela@redhat.com, "Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-devel@nongnu.org, cfontana@suse.de
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Mon, 29 Nov 2021 14:50:21 +0000	[thread overview]
Message-ID: <YaTorUbhzjFhvBl5@work-vm> (raw)
In-Reply-To: <50dbb2b9-152e-f97a-d82e-0b6613e54085@suse.de>

* Li Zhang (lizhang@suse.de) wrote:
> 
> On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
> > > > When doing live migration with multifd channels 8, 16 or larger number,
> > > > the guest hangs in the presence of the network errors such as missing TCP ACKs.
> > > > 
> > > > At sender's side:
> > > > The main thread is blocked on qemu_thread_join, migration_fd_cleanup
> > > > is called because one thread fails on qio_channel_write_all when
> > > > the network problem happens and other send threads are blocked on sendmsg.
> > > > They could not be terminated. So the main thread is blocked on qemu_thread_join
> > > > to wait for the threads terminated.
> > > Isn't the right answer here to ensure we've called 'shutdown' on
> > > all the FDs, so that the threads get kicked out of sendmsg, before
> > > trying to join the thread ?
> > I agree a timeout is wrong here; there is no way to get a good timeout
> > value.
> > However, I'm a bit confused - we should be able to try a shutdown on the
> > receive side using the 'yank' command. - that's what it's there for; Li
> > does this solve your problem?
> 
> No, I tried to register 'yank' on the receive side, the receive threads are
> still waiting there.
> 
> It seems that on send side, 'yank' doesn't work either when the send threads
> are blocked.
> 
> This may be not the case to call yank. I am not quite sure about it.

We need to fix that; 'yank' should be able to recover from any network
issue.  If it's not working we need to understand why.

> > 
> > multifd_load_cleanup already kicks sem_sync before trying to do a
> > thread_join - so have we managed to trigger that on the receive side?
> 
> There is no problem with sem_sync in function multifd_load_cleanup.
> 
> But it is not called in my case, because no errors are detected on the
> receive side.

If you're getting TCP errors why aren't you seeing any errors on the
receive side?

> The problem is here:
> 
> void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
> {
>     MigrationIncomingState *mis = migration_incoming_get_current();
>     Error *local_err = NULL;
>     bool start_migration;
> 
>    ...
> 
>     if (!mis->from_src_file) {
> 
>     ...
> 
>      } else {
>         /* Multiple connections */
>         assert(migrate_use_multifd());
>         start_migration = multifd_recv_new_channel(ioc, &local_err);
>         if (local_err) {
>             error_propagate(errp, local_err);
>             return;
>         }
>     }
>    if (start_migration) {
>         migration_incoming_process();
>     }
> }
> 
> start_migration is always 0, and migration is not started because some
> receive threads are not created.
> 
> No errors are detected here and the main process works well but receive
> threads are all waiting for semaphore.
> 
> It's hard to know if the receive threads are not created. If we can find a
> way to check if any receive threads

So is this only a problem for network issues that happen during startup,
before all the threads have been created?

Dave

> are not created, we can kick the sem_sync and do cleanup.
> 
> From the source code, the thread will be created when QIO channel detects
> something by GIO watch if I understand correctly.
> 
> If nothing is detected, socket_accept_icoming_migration won't be called, the
> thread will not be created.
> 
> socket_start_incoming_migration_internal ->
> 
>     qio_net_listener_set_client_func_full(listener,
> socket_accept_incoming_migration,
>                                           NULL, NULL,
> g_main_context_get_thread_default());
> 
>    qio_net_listener_set_client_func_full ->
> 
>                qio_channel_add_watch_source(
>                 QIO_CHANNEL(listener->sioc[i]), G_IO_IN,
>                 qio_net_listener_channel_func,
>                 listener, (GDestroyNotify)object_unref, context);
> 
>   socket_accept_incoming_migration ->
> 
>        migration_channel_process_incoming ->
> 
>                migration_ioc_process_incoming ->
> 
>                      multifd_recv_new_channel ->
> 
>                             qemu_thread_create(&p->thread, p->name,
> multifd_recv_thread, p,
> QEMU_THREAD_JOINABLE);
> 
> > 
> > Dave
> > 
> > > Regards,
> > > Daniel
> > > -- 
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



  reply	other threads:[~2021-11-29 14:52 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-26 15:31 [PATCH 0/2] migration: multifd live migration improvement Li Zhang
2021-11-26 15:31 ` [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever Li Zhang
2021-11-26 15:49   ` Daniel P. Berrangé
2021-11-26 16:44     ` Li Zhang
2021-11-26 16:51       ` Daniel P. Berrangé
2021-11-26 17:00         ` Li Zhang
2021-11-26 17:13           ` Daniel P. Berrangé
2021-11-26 17:44             ` Li Zhang
2021-11-29 11:20     ` Dr. David Alan Gilbert
2021-11-29 13:37       ` Li Zhang
2021-11-29 14:50         ` Dr. David Alan Gilbert [this message]
2021-11-29 15:34           ` Li Zhang
2021-12-01 12:11           ` Li Zhang
2021-12-01 12:22             ` Daniel P. Berrangé
2021-12-01 13:42               ` Li Zhang
2021-12-01 14:09                 ` Daniel P. Berrangé
2021-12-01 14:15                   ` Li Zhang
2021-11-29 14:58       ` Daniel P. Berrangé
2021-11-29 15:49         ` Dr. David Alan Gilbert
2021-12-06  9:28           ` Li Zhang
2021-11-26 16:33   ` Juan Quintela
2021-11-26 16:56     ` Li Zhang
2021-11-26 15:31 ` [PATCH 2/2] migration: Set the socket backlog number to reduce the chance of live migration failure Li Zhang
2021-11-26 16:32   ` Juan Quintela
2021-11-26 16:44     ` Li Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YaTorUbhzjFhvBl5@work-vm \
    --to=dgilbert@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=lizhang@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.