All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Wang Xin <wangxinxin.wang@huawei.com>,
	qemu-devel@nongnu.org, quintela@redhat.com,
	arei.gonglei@huawei.com
Subject: Re: [Qemu-devel] [PATCH] migration/fd: abort migration if receive POLLHUP event
Date: Wed, 25 Apr 2018 11:31:31 +0800	[thread overview]
Message-ID: <20180425033131.GI9036@xz-mi> (raw)
In-Reply-To: <20180425031423.GH9036@xz-mi>

On Wed, Apr 25, 2018 at 11:14:23AM +0800, Peter Xu wrote:
> On Tue, Apr 24, 2018 at 07:24:05PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 24, 2018 at 06:16:31PM +0100, Dr. David Alan Gilbert wrote:
> > > * Wang Xin (wangxinxin.wang@huawei.com) wrote:
> > > > If the fd socket peer closed shortly, ppoll may receive a POLLHUP
> > > > event before the expected POLLIN event, and qemu will do nothing
> > > > but goes into an infinite loop of the POLLHUP event.
> > > > 
> > > > So, abort the migration if we receive a POLLHUP event.
> > > 
> > > Hi Wang Xin,
> > >   Can you explain how you manage to trigger this case; I've not hit it.
> > > 
> > > > Signed-off-by: Wang Xin <wangxinxin.wang@huawei.com>
> > > > 
> > > > diff --git a/migration/fd.c b/migration/fd.c
> > > > index cd06182..5932c87 100644
> > > > --- a/migration/fd.c
> > > > +++ b/migration/fd.c
> > > > @@ -15,6 +15,7 @@
> > > >   */
> > > >  
> > > >  #include "qemu/osdep.h"
> > > > +#include "qemu/error-report.h"
> > > >  #include "channel.h"
> > > >  #include "fd.h"
> > > >  #include "monitor/monitor.h"
> > > > @@ -46,6 +47,11 @@ static gboolean fd_accept_incoming_migration(QIOChannel *ioc,
> > > >                                               GIOCondition condition,
> > > >                                               gpointer opaque)
> > > >  {
> > > > +    if (condition & G_IO_HUP) {
> > > > +        error_report("The migration peer closed, job abort");
> > > > +        exit(EXIT_FAILURE);
> > > > +    }
> > > > +
> > > 
> > > OK,  I wish we had a nicer way for failing;  especially for the
> > > multifd/postcopy recovery worlds where one failed connection might not
> > > be fatal; but I don't see how to do that here.
> > 
> > This doesn't feel right to me.
> > 
> > We have passed in a pre-opened FD to QEMU, and we registered a watch
> > on it to detect when there is data from the src QEMU that is available
> > to read.  Normally the src will have sent something so we'll get G_IO_IN,
> > but you're suggesting the client has quit immediately, so we're getting
> > G_IO_HUP due to end of file.
> > 
> > The migration_channel_process_incoming() method that we pass the ioc
> > object to will be calling qio_channel_read(ioc) somewhere to try to
> > read that data.
> > 
> > For QEMU to spin in infinite loop there must be code in the
> > migration_channel_process_incoming() that is ignoring the return
> > value of qio_channel_read() in some manner causing it to retry
> > the read again & again I presume.
> > 
> > Putting this check for G_IO_HUP fixes your immediate problem scenario,
> > but whatever code was spinning in infinite loop is still broken and
> > I'd guess it was possible to still trigger the loop. eg by writing
> > 1 single byte and then closing the socket.
> > 
> > So, IMHO this fix is wrong - we need to find the root cause and fix
> > that, not try to avoid calling the buggy code.
> 
> I agree. AFAIU the first read should be in qemu_loadvm_state():
> 
>     v = qemu_get_be32(f);
>     if (v != QEMU_VM_FILE_MAGIC) {
>         error_report("Not a migration stream");
>         return -EINVAL;
>     }
> 
> So I would be curious more about how that infinite loop happened.

Ah, wait.  I just noticed that Xin mentioned about the loop already -
it's an infinite loop of SIGHUP.  I suppose it means that we'll just
never go into fd_accept_incoming_migration() at all?

If so, I'm not sure whether we should just always watch on G_IO_HUP
(and possibly G_IO_ERR too) in qio_channel_create_watch():

  GSource *ret = klass->io_create_watch(ioc, condition | G_IO_HUP | G_IO_ERR);

Otherwise I'm not sure the same loop will happen for other users of
qio_channel_add_watch().

Regards,
-- 
Peter Xu

  reply	other threads:[~2018-04-25  3:31 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-21  7:22 [Qemu-devel] [PATCH] migration/fd: abort migration if receive POLLHUP event Wang Xin
2018-04-24 17:16 ` Dr. David Alan Gilbert
2018-04-24 18:24   ` Daniel P. Berrangé
2018-04-25  3:14     ` Peter Xu
2018-04-25  3:31       ` Peter Xu [this message]
2018-04-25  7:29         ` wangxin (U)
2018-04-25  8:03           ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180425033131.GI9036@xz-mi \
    --to=peterx@redhat.com \
    --cc=arei.gonglei@huawei.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=wangxinxin.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.