From: Li Zhang <lizhang@suse.de>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: qemu-devel@nongnu.org, cfontana@suse.de,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
quintela@redhat.com
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Wed, 1 Dec 2021 15:15:15 +0100 [thread overview]
Message-ID: <11b18dc6-b850-0ff7-e033-06593635f5b2@suse.de> (raw)
In-Reply-To: <YaeCNK60ziPm9p4N@redhat.com>
On 12/1/21 3:09 PM, Daniel P. Berrangé wrote:
> On Wed, Dec 01, 2021 at 02:42:04PM +0100, Li Zhang wrote:
>> On 12/1/21 1:22 PM, Daniel P. Berrangé wrote:
>>> On Wed, Dec 01, 2021 at 01:11:13PM +0100, Li Zhang wrote:
>>>> On 11/29/21 3:50 PM, Dr. David Alan Gilbert wrote:
>>>>> * Li Zhang (lizhang@suse.de) wrote:
>>>>>> On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
>>>>>>> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>>>>>>>> On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
>>>>>>>>> When doing live migration with multifd channels 8, 16 or larger number,
>>>>>>>>> the guest hangs in the presence of the network errors such as missing TCP ACKs.
>>>>>>>>>
>>>>>>>>> At sender's side:
>>>>>>>>> The main thread is blocked on qemu_thread_join, migration_fd_cleanup
>>>>>>>>> is called because one thread fails on qio_channel_write_all when
>>>>>>>>> the network problem happens and other send threads are blocked on sendmsg.
>>>>>>>>> They could not be terminated. So the main thread is blocked on qemu_thread_join
>>>>>>>>> to wait for the threads terminated.
>>>>>>>> Isn't the right answer here to ensure we've called 'shutdown' on
>>>>>>>> all the FDs, so that the threads get kicked out of sendmsg, before
>>>>>>>> trying to join the thread ?
>>>>>>> I agree a timeout is wrong here; there is no way to get a good timeout
>>>>>>> value.
>>>>>>> However, I'm a bit confused - we should be able to try a shutdown on the
>>>>>>> receive side using the 'yank' command. - that's what it's there for; Li
>>>>>>> does this solve your problem?
>>>>>> No, I tried to register 'yank' on the receive side, the receive threads are
>>>>>> still waiting there.
>>>>>>
>>>>>> It seems that on send side, 'yank' doesn't work either when the send threads
>>>>>> are blocked.
>>>>>>
>>>>>> This may be not the case to call yank. I am not quite sure about it.
>>>>> We need to fix that; 'yank' should be able to recover from any network
>>>>> issue. If it's not working we need to understand why.
>>>> Hi Dr. David,
>>>>
>>>> On the receive side, I register 'yank' and it is called. But it is just to
>>>> shut down the channels,
>>>>
>>>> it couldn't fix the problem of the receive threads which are waiting for the
>>>> semaphore.
>>>>
>>>> So the receive threads are still waiting there.
>>>>
>>>> On the send side, the main process is blocked on qemu_thread_join(), when I
>>>> tried the 'yank'
>>>>
>>>> command with QMP, it is not handled. So the QMP doesn't work and yank
>>>> doesn't work.
>>> IOW, there is a bug in QEMU on the send side. It should not be calling
>>> qemu_thread_join() from the main thread, unless it is extremely
>>> confident that the thread in question has already finished.
>>>
>>> You seem to be showing that the thread(s) are still running, so we
>>> need to understand why that is the case, and why the main thread
>>> still decided to try to join these threads which haven't finished.
>> Some threads are running. But there is one thread fails to
>> qio_channel_write_all.
>>
>> In migration_thread(), it detects an error here:
>>
>> thr_error = migration_detect_error(s);
>> if (thr_error == MIG_THR_ERR_FATAL) {
>> /* Stop migration */
>> break;
>>
>> It will stop migration and cleanup.
> Those threads which are still running need to be made to
> terminate before trying to join them
>
> A quick glance at multifd_send_terminate_threads() makes me
> suspect multifd shutdown is not reliable.
>
> It is merely setting some boolean flags and posting to a
> semaphore. It is doing nothing to shutdown the socket
> associated with each thread, so the threads can still be
> waiting in an I/O call. IMHO multifd_send_terminate_threads
> needs to call qio_chanel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH)
Agree with you.
>
> Regards,
> Daniel
next prev parent reply other threads:[~2021-12-01 14:20 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-26 15:31 [PATCH 0/2] migration: multifd live migration improvement Li Zhang
2021-11-26 15:31 ` [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever Li Zhang
2021-11-26 15:49 ` Daniel P. Berrangé
2021-11-26 16:44 ` Li Zhang
2021-11-26 16:51 ` Daniel P. Berrangé
2021-11-26 17:00 ` Li Zhang
2021-11-26 17:13 ` Daniel P. Berrangé
2021-11-26 17:44 ` Li Zhang
2021-11-29 11:20 ` Dr. David Alan Gilbert
2021-11-29 13:37 ` Li Zhang
2021-11-29 14:50 ` Dr. David Alan Gilbert
2021-11-29 15:34 ` Li Zhang
2021-12-01 12:11 ` Li Zhang
2021-12-01 12:22 ` Daniel P. Berrangé
2021-12-01 13:42 ` Li Zhang
2021-12-01 14:09 ` Daniel P. Berrangé
2021-12-01 14:15 ` Li Zhang [this message]
2021-11-29 14:58 ` Daniel P. Berrangé
2021-11-29 15:49 ` Dr. David Alan Gilbert
2021-12-06 9:28 ` Li Zhang
2021-11-26 16:33 ` Juan Quintela
2021-11-26 16:56 ` Li Zhang
2021-11-26 15:31 ` [PATCH 2/2] migration: Set the socket backlog number to reduce the chance of live migration failure Li Zhang
2021-11-26 16:32 ` Juan Quintela
2021-11-26 16:44 ` Li Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=11b18dc6-b850-0ff7-e033-06593635f5b2@suse.de \
--to=lizhang@suse.de \
--cc=berrange@redhat.com \
--cc=cfontana@suse.de \
--cc=dgilbert@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).