qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Li Zhang <lizhang@suse.de>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: quintela@redhat.com, "Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-devel@nongnu.org, cfontana@suse.de
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Mon, 29 Nov 2021 16:34:42 +0100	[thread overview]
Message-ID: <97c18ffc-2ce9-145f-6f86-d13fe32e5162@suse.de> (raw)
In-Reply-To: <YaTorUbhzjFhvBl5@work-vm>


On 11/29/21 3:50 PM, Dr. David Alan Gilbert wrote:
> * Li Zhang (lizhang@suse.de) wrote:
>> On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
>>> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>>>> On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
>>>>> When doing live migration with multifd channels 8, 16 or larger number,
>>>>> the guest hangs in the presence of the network errors such as missing TCP ACKs.
>>>>>
>>>>> At sender's side:
>>>>> The main thread is blocked on qemu_thread_join, migration_fd_cleanup
>>>>> is called because one thread fails on qio_channel_write_all when
>>>>> the network problem happens and other send threads are blocked on sendmsg.
>>>>> They could not be terminated. So the main thread is blocked on qemu_thread_join
>>>>> to wait for the threads terminated.
>>>> Isn't the right answer here to ensure we've called 'shutdown' on
>>>> all the FDs, so that the threads get kicked out of sendmsg, before
>>>> trying to join the thread ?
>>> I agree a timeout is wrong here; there is no way to get a good timeout
>>> value.
>>> However, I'm a bit confused - we should be able to try a shutdown on the
>>> receive side using the 'yank' command. - that's what it's there for; Li
>>> does this solve your problem?
>> No, I tried to register 'yank' on the receive side, the receive threads are
>> still waiting there.
>>
>> It seems that on send side, 'yank' doesn't work either when the send threads
>> are blocked.
>>
>> This may be not the case to call yank. I am not quite sure about it.
> We need to fix that; 'yank' should be able to recover from any network
> issue.  If it's not working we need to understand why.

OK, I will look into it.

>
>>> multifd_load_cleanup already kicks sem_sync before trying to do a
>>> thread_join - so have we managed to trigger that on the receive side?
>> There is no problem with sem_sync in function multifd_load_cleanup.
>>
>> But it is not called in my case, because no errors are detected on the
>> receive side.
> If you're getting TCP errors why aren't you seeing any errors on the
> receive side?

That's  a good point. I need to find out it.

>
>> The problem is here:
>>
>> void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
>> {
>>      MigrationIncomingState *mis = migration_incoming_get_current();
>>      Error *local_err = NULL;
>>      bool start_migration;
>>
>>     ...
>>
>>      if (!mis->from_src_file) {
>>
>>      ...
>>
>>       } else {
>>          /* Multiple connections */
>>          assert(migrate_use_multifd());
>>          start_migration = multifd_recv_new_channel(ioc, &local_err);
>>          if (local_err) {
>>              error_propagate(errp, local_err);
>>              return;
>>          }
>>      }
>>     if (start_migration) {
>>          migration_incoming_process();
>>      }
>> }
>>
>> start_migration is always 0, and migration is not started because some
>> receive threads are not created.
>>
>> No errors are detected here and the main process works well but receive
>> threads are all waiting for semaphore.
>>
>> It's hard to know if the receive threads are not created. If we can find a
>> way to check if any receive threads
> So is this only a problem for network issues that happen during startup,
> before all the threads have been created?

Yes, it is.

>
> Dave
>
>> are not created, we can kick the sem_sync and do cleanup.
>>
>>  From the source code, the thread will be created when QIO channel detects
>> something by GIO watch if I understand correctly.
>>
>> If nothing is detected, socket_accept_icoming_migration won't be called, the
>> thread will not be created.
>>
>> socket_start_incoming_migration_internal ->
>>
>>      qio_net_listener_set_client_func_full(listener,
>> socket_accept_incoming_migration,
>>                                            NULL, NULL,
>> g_main_context_get_thread_default());
>>
>>     qio_net_listener_set_client_func_full ->
>>
>>                 qio_channel_add_watch_source(
>>                  QIO_CHANNEL(listener->sioc[i]), G_IO_IN,
>>                  qio_net_listener_channel_func,
>>                  listener, (GDestroyNotify)object_unref, context);
>>
>>    socket_accept_incoming_migration ->
>>
>>         migration_channel_process_incoming ->
>>
>>                 migration_ioc_process_incoming ->
>>
>>                       multifd_recv_new_channel ->
>>
>>                              qemu_thread_create(&p->thread, p->name,
>> multifd_recv_thread, p,
>> QEMU_THREAD_JOINABLE);
>>
>>> Dave
>>>
>>>> Regards,
>>>> Daniel
>>>> -- 
>>>> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>>>> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>>>> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>>>>


  reply	other threads:[~2021-11-29 15:36 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-26 15:31 [PATCH 0/2] migration: multifd live migration improvement Li Zhang
2021-11-26 15:31 ` [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever Li Zhang
2021-11-26 15:49   ` Daniel P. Berrangé
2021-11-26 16:44     ` Li Zhang
2021-11-26 16:51       ` Daniel P. Berrangé
2021-11-26 17:00         ` Li Zhang
2021-11-26 17:13           ` Daniel P. Berrangé
2021-11-26 17:44             ` Li Zhang
2021-11-29 11:20     ` Dr. David Alan Gilbert
2021-11-29 13:37       ` Li Zhang
2021-11-29 14:50         ` Dr. David Alan Gilbert
2021-11-29 15:34           ` Li Zhang [this message]
2021-12-01 12:11           ` Li Zhang
2021-12-01 12:22             ` Daniel P. Berrangé
2021-12-01 13:42               ` Li Zhang
2021-12-01 14:09                 ` Daniel P. Berrangé
2021-12-01 14:15                   ` Li Zhang
2021-11-29 14:58       ` Daniel P. Berrangé
2021-11-29 15:49         ` Dr. David Alan Gilbert
2021-12-06  9:28           ` Li Zhang
2021-11-26 16:33   ` Juan Quintela
2021-11-26 16:56     ` Li Zhang
2021-11-26 15:31 ` [PATCH 2/2] migration: Set the socket backlog number to reduce the chance of live migration failure Li Zhang
2021-11-26 16:32   ` Juan Quintela
2021-11-26 16:44     ` Li Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=97c18ffc-2ce9-145f-6f86-d13fe32e5162@suse.de \
    --to=lizhang@suse.de \
    --cc=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=dgilbert@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).