From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42656) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dcpxb-0002Be-Ou for qemu-devel@nongnu.org; Wed, 02 Aug 2017 05:29:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dcpxX-0007xh-O0 for qemu-devel@nongnu.org; Wed, 02 Aug 2017 05:28:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:34012) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dcpxX-0007wj-Dk for qemu-devel@nongnu.org; Wed, 02 Aug 2017 05:28:55 -0400 Date: Wed, 2 Aug 2017 10:28:47 +0100 From: "Daniel P. Berrange" Message-ID: <20170802092847.GC4098@redhat.com> Reply-To: "Daniel P. Berrange" References: <1501229198-30588-1-git-send-email-peterx@redhat.com> <1501229198-30588-17-git-send-email-peterx@redhat.com> <20170801110348.GH12670@redhat.com> <20170802055646.GE15080@pxdev.xzpeter.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20170802055646.GE15080@pxdev.xzpeter.org> Subject: Re: [Qemu-devel] [RFC 16/29] qmp: hmp: add migrate "resume" option List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Xu Cc: qemu-devel@nongnu.org, Laurent Vivier , Andrea Arcangeli , Juan Quintela , Alexey Perevalov , "Dr . David Alan Gilbert" On Wed, Aug 02, 2017 at 01:56:46PM +0800, Peter Xu wrote: > On Tue, Aug 01, 2017 at 12:03:48PM +0100, Daniel P. Berrange wrote: > > On Fri, Jul 28, 2017 at 04:06:25PM +0800, Peter Xu wrote: > > > It will be used when we want to resume one paused migration. > > > > > > Signed-off-by: Peter Xu > > > --- > > > hmp-commands.hx | 7 ++++--- > > > hmp.c | 4 +++- > > > migration/migration.c | 2 +- > > > qapi-schema.json | 5 ++++- > > > 4 files changed, 12 insertions(+), 6 deletions(-) > > > > I'm not seeing explicit info about how we handle the original failure > > and how it relates to this resume command, but this feels like a > > potentially racy approach to me. > > > > If we have a network problem between source & target, we could see > > two results. Either the TCP stream will simply hang (it'll still > > appear open to QEMU but no traffic will be flowing), > > (let's say this is the "1st condition") > > > or the connection > > may actually break such that we get EOF and end up closing the file > > descriptor. > > (let's say this is the "2nd condition") > > > > > In the latter case, we're ok because the original channel is now > > gone and we can safely establish the new one by issuing the new > > 'migrate --resume URI' command. > > > > In the former case, however, there is the possibility that the > > hang may come back to life at some point, concurrently with us > > trying to do 'migrate --resume URI' and I'm unclear on the > > semantics if that happens. > > > > Should the original connection carry on, and thus cause the > > 'migrate --resume' command to fail, or will we forcably terminate > > the original connection no matter what and use the new "resumed" > > connection. > > Hmm yes, this is a good question. Currently this series is only > handling the 2nd condition, say, when we can detect the error via > system calls (IIUC we can know nothing when the 1st condition is > encountered, we just e.g. block at the system calls as usual when > reading the file handle). And currently the "resume" command is only > allowed if the 2nd condition is detected (so it will never destroy an > existing channel). > > If you see the next following patch, there is something like: > > if (has_resume && resume) { > if (s->state != MIGRATION_STATUS_POSTCOPY_PAUSED) { > error_setg(errp, "Cannot resume if there is no " > "paused migration"); > return; > } > goto do_resume; > } > > And here MIGRATION_STATUS_POSTCOPY_PAUSED will only be set when the > 2nd condition is met. > > > > > There's also synchronization with the target host - at the time we > > want to recover, we need to be able to tell the target to accept > > new incoming clients again, but we don't want to do that if the > > original connection comes back to life. > > Yeah, I hacked this part in this v1 series (as you may have seen) to > keep the ports open-forever. I am not sure whether that is acceptable, > but looks not. :) > > How about this: when destination detected 2nd condition, it firstly > switch to "postcopy-pause" state, then re-opens the accept channels. > And it can turns the accept channels off when the state moves out of > "postcopy-pause". > > > > > It feels to me that if the mgmt app or admin believes the migration > > is in a stuck state, we should be able to explicitly terminate the > > existing connection via a monitor command. Then setup the target > > host to accept new client, and then issue this migrate resume on > > the source. > > Totally agree. That should be the only way to handle 1st condition > well. However, would you mind if I postpone it a bit? IMHO as long as > we can solve the 2nd condition nicely (which is the goal of this > series), then it won't be too hard to continue support the 1st > condition. Sure, the 1st scenario is an easy bolt on to the second scenario. I just wanted to be clear about what the target of these patches is, because I think the 1st scenario is probably the most common one. I guess if you have TCP keepalives enabled with a reasonably short timeout, the 1st scenario will turn into the 2nd scenario fairly quickly. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|