From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42038)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1XSPUI-0004wM-6I
	for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:58:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1XSPUD-0002U8-Ji
	for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:58:02 -0400
Received: from mx1.redhat.com ([209.132.183.28]:60596)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1XSPUD-0002Tv-Ao
	for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:57:57 -0400
Date: Fri, 12 Sep 2014 12:57:46 +0100
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20140912115746.GF2413@work-vm>
References: <1406125538-27992-1-git-send-email-yanghy@cn.fujitsu.com>
	<1406125538-27992-12-git-send-email-yanghy@cn.fujitsu.com>
	<20140801150347.GE2430@work-vm> <541290C5.4010905@cn.fujitsu.com>
	<20140912111722.GD2413@work-vm> <5412DBA4.1060408@cn.fujitsu.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5412DBA4.1060408@cn.fujitsu.com>
Subject: Re: [Qemu-devel] [RFC PATCH 11/17] COLO ctl: implement colo
	checkpoint protocol
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Hongyang Yang <yanghy@cn.fujitsu.com>
Cc: kvm@vger.kernel.org, GuiJianfeng@cn.fujitsu.com, eddie.dong@intel.com, qemu-devel@nongnu.org, mrhines@linux.vnet.ibm.com

* Hongyang Yang (yanghy@cn.fujitsu.com) wrote:
> 
> 
> ??? 09/12/2014 07:17 PM, Dr. David Alan Gilbert ??????:
> >* Hongyang Yang (yanghy@cn.fujitsu.com) wrote:
> >>
> >>
> >>??? 08/01/2014 11:03 PM, Dr. David Alan Gilbert ??????:
> >>>* Yang Hongyang (yanghy@cn.fujitsu.com) wrote:
> >
> ><snip>
> >
> >>>>+static int do_colo_transaction(MigrationState *s, QEMUFile *control,
> >>>>+                               QEMUFile *trans)
> >>>>+{
> >>>>+    int ret;
> >>>>+
> >>>>+    ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
> >>>>+    if (ret) {
> >>>>+        goto out;
> >>>>+    }
> >>>>+
> >>>>+    ret = colo_ctl_get(control, COLO_CHECKPOINT_SUSPENDED);
> >>>
> >>>What happens at this point if the slave just doesn't respond?
> >>>(i.e. the socket doesn't drop - you just don't get the byte).
> >>
> >>If the socket return bytes that were not expected, exit. If
> >>socket return error, do some cleanup and quit COLO process.
> >>refer to: colo_ctl_get() and colo_ctl_get_value()
> >
> >But what happens if the slave just doesn't respond at all; e.g.
> >if the slave host loses power, it'll take a while (many seconds)
> >before the socket will timeout.
> 
> It will wait until the call returns timeout error, and then do some
> cleanup and quit COLO process.

If it was to wait here for ~30seconds for the timeout what would happen
to the primary? Would it be stopped from sending any network traffic
for those 30 seconds - I think that's too long to fail over.

> There may be better way to handle this?

In postcopy I always take reads coming back from the destination
in a separate thread, because that thread can't block the main thread
going out (I originally did that using async reads but the thread
is nicer).  You could also use something like a poll() with a shorter
timeout to however long you are happy for COLO to go before it fails.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK