qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Li Zhang <lizhang@suse.de>
Cc: quintela@redhat.com, qemu-devel@nongnu.org, dgilbert@redhat.com,
	cfontana@suse.de
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Fri, 26 Nov 2021 17:13:28 +0000	[thread overview]
Message-ID: <YaEVuHJuWcwXISTI@redhat.com> (raw)
In-Reply-To: <3a976c2c-3f0e-9be7-b845-77144ce8c275@suse.de>

On Fri, Nov 26, 2021 at 06:00:24PM +0100, Li Zhang wrote:
> 
> On 11/26/21 5:51 PM, Daniel P. Berrangé wrote:
> > On Fri, Nov 26, 2021 at 05:44:04PM +0100, Li Zhang wrote:
> > > On 11/26/21 4:49 PM, Daniel P. Berrangé wrote:
> > > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
> > > > > When doing live migration with multifd channels 8, 16 or larger number,
> > > > > the guest hangs in the presence of the network errors such as missing TCP ACKs.
> > > > > 
> > > > > At sender's side:
> > > > > The main thread is blocked on qemu_thread_join, migration_fd_cleanup
> > > > > is called because one thread fails on qio_channel_write_all when
> > > > > the network problem happens and other send threads are blocked on sendmsg.
> > > > > They could not be terminated. So the main thread is blocked on qemu_thread_join
> > > > > to wait for the threads terminated.
> > > > Isn't the right answer here to ensure we've called 'shutdown' on
> > > > all the FDs, so that the threads get kicked out of sendmsg, before
> > > > trying to join the thread ?
> > > If we shutdown the channels at sender's side, it could terminate send
> > > threads. The receive threads are still waiting there.
> > > 
> > >  From receiver's side, if wait semaphore is timeout, the channels can be
> > > terminated at last. And the sender threads also be terminated at last.
> > If something goes wrong on the sender side, the mgmt app should be
> > tearing down the destination QEMU entirely, so I'm not sure we need
> > to do anything special to deal with received threads.
> > 
> > Using semtimedwait just feels risky because it will introduce false
> > failures if the system/network is under high load such that the
> > connections don't all establish within 1 second.
> 
> You are right. This may be a risk. I am not sure if the interval is proper,
> we can set longer.

I don't think any kind of timeout is right in this context. There should
be a sem_post() invoked in every scenario where we want to tear down the
recv thread. That should only be the case when we see the other end of
the connection close IMHO.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



  reply	other threads:[~2021-11-26 17:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-26 15:31 [PATCH 0/2] migration: multifd live migration improvement Li Zhang
2021-11-26 15:31 ` [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever Li Zhang
2021-11-26 15:49   ` Daniel P. Berrangé
2021-11-26 16:44     ` Li Zhang
2021-11-26 16:51       ` Daniel P. Berrangé
2021-11-26 17:00         ` Li Zhang
2021-11-26 17:13           ` Daniel P. Berrangé [this message]
2021-11-26 17:44             ` Li Zhang
2021-11-29 11:20     ` Dr. David Alan Gilbert
2021-11-29 13:37       ` Li Zhang
2021-11-29 14:50         ` Dr. David Alan Gilbert
2021-11-29 15:34           ` Li Zhang
2021-12-01 12:11           ` Li Zhang
2021-12-01 12:22             ` Daniel P. Berrangé
2021-12-01 13:42               ` Li Zhang
2021-12-01 14:09                 ` Daniel P. Berrangé
2021-12-01 14:15                   ` Li Zhang
2021-11-29 14:58       ` Daniel P. Berrangé
2021-11-29 15:49         ` Dr. David Alan Gilbert
2021-12-06  9:28           ` Li Zhang
2021-11-26 16:33   ` Juan Quintela
2021-11-26 16:56     ` Li Zhang
2021-11-26 15:31 ` [PATCH 2/2] migration: Set the socket backlog number to reduce the chance of live migration failure Li Zhang
2021-11-26 16:32   ` Juan Quintela
2021-11-26 16:44     ` Li Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YaEVuHJuWcwXISTI@redhat.com \
    --to=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=dgilbert@redhat.com \
    --cc=lizhang@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).