From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42038) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSPUI-0004wM-6I for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:58:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSPUD-0002U8-Ji for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:58:02 -0400 Received: from mx1.redhat.com ([209.132.183.28]:60596) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSPUD-0002Tv-Ao for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:57:57 -0400 Date: Fri, 12 Sep 2014 12:57:46 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20140912115746.GF2413@work-vm> References: <1406125538-27992-1-git-send-email-yanghy@cn.fujitsu.com> <1406125538-27992-12-git-send-email-yanghy@cn.fujitsu.com> <20140801150347.GE2430@work-vm> <541290C5.4010905@cn.fujitsu.com> <20140912111722.GD2413@work-vm> <5412DBA4.1060408@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5412DBA4.1060408@cn.fujitsu.com> Subject: Re: [Qemu-devel] [RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hongyang Yang Cc: kvm@vger.kernel.org, GuiJianfeng@cn.fujitsu.com, eddie.dong@intel.com, qemu-devel@nongnu.org, mrhines@linux.vnet.ibm.com * Hongyang Yang (yanghy@cn.fujitsu.com) wrote: > > > ??? 09/12/2014 07:17 PM, Dr. David Alan Gilbert ??????: > >* Hongyang Yang (yanghy@cn.fujitsu.com) wrote: > >> > >> > >>??? 08/01/2014 11:03 PM, Dr. David Alan Gilbert ??????: > >>>* Yang Hongyang (yanghy@cn.fujitsu.com) wrote: > > > > > > > >>>>+static int do_colo_transaction(MigrationState *s, QEMUFile *control, > >>>>+ QEMUFile *trans) > >>>>+{ > >>>>+ int ret; > >>>>+ > >>>>+ ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW); > >>>>+ if (ret) { > >>>>+ goto out; > >>>>+ } > >>>>+ > >>>>+ ret = colo_ctl_get(control, COLO_CHECKPOINT_SUSPENDED); > >>> > >>>What happens at this point if the slave just doesn't respond? > >>>(i.e. the socket doesn't drop - you just don't get the byte). > >> > >>If the socket return bytes that were not expected, exit. If > >>socket return error, do some cleanup and quit COLO process. > >>refer to: colo_ctl_get() and colo_ctl_get_value() > > > >But what happens if the slave just doesn't respond at all; e.g. > >if the slave host loses power, it'll take a while (many seconds) > >before the socket will timeout. > > It will wait until the call returns timeout error, and then do some > cleanup and quit COLO process. If it was to wait here for ~30seconds for the timeout what would happen to the primary? Would it be stopped from sending any network traffic for those 30 seconds - I think that's too long to fail over. > There may be better way to handle this? In postcopy I always take reads coming back from the destination in a separate thread, because that thread can't block the main thread going out (I originally did that using async reads but the thread is nicer). You could also use something like a poll() with a shorter timeout to however long you are happy for COLO to go before it fails. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK