From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Wang Xin <wangxinxin.wang@huawei.com>,
qemu-devel@nongnu.org, quintela@redhat.com,
arei.gonglei@huawei.com
Subject: Re: [Qemu-devel] [PATCH] migration/fd: abort migration if receive POLLHUP event
Date: Wed, 25 Apr 2018 11:31:31 +0800 [thread overview]
Message-ID: <20180425033131.GI9036@xz-mi> (raw)
In-Reply-To: <20180425031423.GH9036@xz-mi>
On Wed, Apr 25, 2018 at 11:14:23AM +0800, Peter Xu wrote:
> On Tue, Apr 24, 2018 at 07:24:05PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 24, 2018 at 06:16:31PM +0100, Dr. David Alan Gilbert wrote:
> > > * Wang Xin (wangxinxin.wang@huawei.com) wrote:
> > > > If the fd socket peer closed shortly, ppoll may receive a POLLHUP
> > > > event before the expected POLLIN event, and qemu will do nothing
> > > > but goes into an infinite loop of the POLLHUP event.
> > > >
> > > > So, abort the migration if we receive a POLLHUP event.
> > >
> > > Hi Wang Xin,
> > > Can you explain how you manage to trigger this case; I've not hit it.
> > >
> > > > Signed-off-by: Wang Xin <wangxinxin.wang@huawei.com>
> > > >
> > > > diff --git a/migration/fd.c b/migration/fd.c
> > > > index cd06182..5932c87 100644
> > > > --- a/migration/fd.c
> > > > +++ b/migration/fd.c
> > > > @@ -15,6 +15,7 @@
> > > > */
> > > >
> > > > #include "qemu/osdep.h"
> > > > +#include "qemu/error-report.h"
> > > > #include "channel.h"
> > > > #include "fd.h"
> > > > #include "monitor/monitor.h"
> > > > @@ -46,6 +47,11 @@ static gboolean fd_accept_incoming_migration(QIOChannel *ioc,
> > > > GIOCondition condition,
> > > > gpointer opaque)
> > > > {
> > > > + if (condition & G_IO_HUP) {
> > > > + error_report("The migration peer closed, job abort");
> > > > + exit(EXIT_FAILURE);
> > > > + }
> > > > +
> > >
> > > OK, I wish we had a nicer way for failing; especially for the
> > > multifd/postcopy recovery worlds where one failed connection might not
> > > be fatal; but I don't see how to do that here.
> >
> > This doesn't feel right to me.
> >
> > We have passed in a pre-opened FD to QEMU, and we registered a watch
> > on it to detect when there is data from the src QEMU that is available
> > to read. Normally the src will have sent something so we'll get G_IO_IN,
> > but you're suggesting the client has quit immediately, so we're getting
> > G_IO_HUP due to end of file.
> >
> > The migration_channel_process_incoming() method that we pass the ioc
> > object to will be calling qio_channel_read(ioc) somewhere to try to
> > read that data.
> >
> > For QEMU to spin in infinite loop there must be code in the
> > migration_channel_process_incoming() that is ignoring the return
> > value of qio_channel_read() in some manner causing it to retry
> > the read again & again I presume.
> >
> > Putting this check for G_IO_HUP fixes your immediate problem scenario,
> > but whatever code was spinning in infinite loop is still broken and
> > I'd guess it was possible to still trigger the loop. eg by writing
> > 1 single byte and then closing the socket.
> >
> > So, IMHO this fix is wrong - we need to find the root cause and fix
> > that, not try to avoid calling the buggy code.
>
> I agree. AFAIU the first read should be in qemu_loadvm_state():
>
> v = qemu_get_be32(f);
> if (v != QEMU_VM_FILE_MAGIC) {
> error_report("Not a migration stream");
> return -EINVAL;
> }
>
> So I would be curious more about how that infinite loop happened.
Ah, wait. I just noticed that Xin mentioned about the loop already -
it's an infinite loop of SIGHUP. I suppose it means that we'll just
never go into fd_accept_incoming_migration() at all?
If so, I'm not sure whether we should just always watch on G_IO_HUP
(and possibly G_IO_ERR too) in qio_channel_create_watch():
GSource *ret = klass->io_create_watch(ioc, condition | G_IO_HUP | G_IO_ERR);
Otherwise I'm not sure the same loop will happen for other users of
qio_channel_add_watch().
Regards,
--
Peter Xu
next prev parent reply other threads:[~2018-04-25 3:31 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-21 7:22 [Qemu-devel] [PATCH] migration/fd: abort migration if receive POLLHUP event Wang Xin
2018-04-24 17:16 ` Dr. David Alan Gilbert
2018-04-24 18:24 ` Daniel P. Berrangé
2018-04-25 3:14 ` Peter Xu
2018-04-25 3:31 ` Peter Xu [this message]
2018-04-25 7:29 ` wangxin (U)
2018-04-25 8:03 ` Daniel P. Berrangé
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180425033131.GI9036@xz-mi \
--to=peterx@redhat.com \
--cc=arei.gonglei@huawei.com \
--cc=berrange@redhat.com \
--cc=dgilbert@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=wangxinxin.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).