From: Li Zhang <lizhang@suse.de>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Daniel P. Berrangé" <berrange@redhat.com>
Cc: qemu-devel@nongnu.org, cfontana@suse.de, quintela@redhat.com
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Mon, 6 Dec 2021 10:28:33 +0100 [thread overview]
Message-ID: <f4a3b973-b968-a40b-0699-0c6c0f79b1a6@suse.de> (raw)
In-Reply-To: <YaT2nMsL18cZxPgk@work-vm>
On 11/29/21 4:49 PM, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>> On Mon, Nov 29, 2021 at 11:20:08AM +0000, Dr. David Alan Gilbert wrote:
>>> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>>>> On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
>>>>> When doing live migration with multifd channels 8, 16 or larger number,
>>>>> the guest hangs in the presence of the network errors such as missing TCP ACKs.
>>>>>
>>>>> At sender's side:
>>>>> The main thread is blocked on qemu_thread_join, migration_fd_cleanup
>>>>> is called because one thread fails on qio_channel_write_all when
>>>>> the network problem happens and other send threads are blocked on sendmsg.
>>>>> They could not be terminated. So the main thread is blocked on qemu_thread_join
>>>>> to wait for the threads terminated.
>>>> Isn't the right answer here to ensure we've called 'shutdown' on
>>>> all the FDs, so that the threads get kicked out of sendmsg, before
>>>> trying to join the thread ?
>>> I agree a timeout is wrong here; there is no way to get a good timeout
>>> value.
>>> However, I'm a bit confused - we should be able to try a shutdown on the
>>> receive side using the 'yank' command. - that's what it's there for; Li
>>> does this solve your problem?
>> Why do we even need to use 'yank' on the receive side ? Until migration
>> has switched over from src to dst, the receive side is discardable and
>> the whole process can just be teminated with kill(SIGTERM/SIGKILL).
> True, although it's nice to be able to quit cleanly.
I found that the 'yank' function has been registered on receive side
actually.
It's different from the send side.
It's in the function:
void migration_channel_process_incoming(QIOChannel *ioc)
{
MigrationState *s = migrate_get_current();
Error *local_err = NULL;
trace_migration_set_incoming_channel(
ioc, object_get_typename(OBJECT(ioc)));
if (s->parameters.tls_creds &&
*s->parameters.tls_creds &&
!object_dynamic_cast(OBJECT(ioc),
TYPE_QIO_CHANNEL_TLS)) {
migration_tls_channel_process_incoming(s, ioc, &local_err);
} else {
migration_ioc_register_yank(ioc);
migration_ioc_process_incoming(ioc, &local_err);
}
if (local_err) {
error_report_err(local_err);
}
}
>
>> On the source side 'yank' is needed, because the QEMU process is still
>> running the live workload and thus is precious and mustn't be killed.
> True.
>
> Dave
>
>> Regards,
>> Daniel
>> --
>> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
>> |: https://libvirt.org -o- https://fstop138.berrange.com :|
>> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
>>
next prev parent reply other threads:[~2021-12-06 9:30 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-26 15:31 [PATCH 0/2] migration: multifd live migration improvement Li Zhang
2021-11-26 15:31 ` [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever Li Zhang
2021-11-26 15:49 ` Daniel P. Berrangé
2021-11-26 16:44 ` Li Zhang
2021-11-26 16:51 ` Daniel P. Berrangé
2021-11-26 17:00 ` Li Zhang
2021-11-26 17:13 ` Daniel P. Berrangé
2021-11-26 17:44 ` Li Zhang
2021-11-29 11:20 ` Dr. David Alan Gilbert
2021-11-29 13:37 ` Li Zhang
2021-11-29 14:50 ` Dr. David Alan Gilbert
2021-11-29 15:34 ` Li Zhang
2021-12-01 12:11 ` Li Zhang
2021-12-01 12:22 ` Daniel P. Berrangé
2021-12-01 13:42 ` Li Zhang
2021-12-01 14:09 ` Daniel P. Berrangé
2021-12-01 14:15 ` Li Zhang
2021-11-29 14:58 ` Daniel P. Berrangé
2021-11-29 15:49 ` Dr. David Alan Gilbert
2021-12-06 9:28 ` Li Zhang [this message]
2021-11-26 16:33 ` Juan Quintela
2021-11-26 16:56 ` Li Zhang
2021-11-26 15:31 ` [PATCH 2/2] migration: Set the socket backlog number to reduce the chance of live migration failure Li Zhang
2021-11-26 16:32 ` Juan Quintela
2021-11-26 16:44 ` Li Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f4a3b973-b968-a40b-0699-0c6c0f79b1a6@suse.de \
--to=lizhang@suse.de \
--cc=berrange@redhat.com \
--cc=cfontana@suse.de \
--cc=dgilbert@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).