From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58501)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aG7Q2-0004AM-BQ
	for qemu-devel@nongnu.org; Mon, 04 Jan 2016 10:51:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aG7Q1-00088K-D4
	for qemu-devel@nongnu.org; Mon, 04 Jan 2016 10:51:38 -0500
Date: Mon, 4 Jan 2016 15:51:26 +0000
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20160104155126.GH2529@work-vm>
References: <1449034311-4094-1-git-send-email-wency@cn.fujitsu.com>
	<1449034311-4094-6-git-send-email-wency@cn.fujitsu.com>
	<20151223092603.GA11394@stefanha-x1.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151223092603.GA11394@stefanha-x1.localdomain>
Subject: Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's
	description
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <famz@redhat.com>, qemu block <qemu-block@nongnu.org>, Jiang Yunhong <yunhong.jiang@intel.com>, Dong Eddie <eddie.dong@intel.com>, qemu devel <qemu-devel@nongnu.org>, "Michael R. Hines" <mrhines@linux.vnet.ibm.com>, Max Reitz <mreitz@redhat.com>, Gonglei <arei.gonglei@huawei.com>, Paolo Bonzini <pbonzini@redhat.com>, zhanghailiang <zhang.zhanghailiang@huawei.com>

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote:
> > +== Failure Handling ==
> > +There are 6 internal errors when block replication is running:
> > +1. I/O error on primary disk
> > +2. Forwarding primary write requests failed
> > +3. Backup failed
> > +4. I/O error on secondary disk
> > +5. I/O error on active disk
> > +6. Making active disk or hidden disk empty failed
> > +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
> > +4 and 6, we just report block replication's error to FT/HA manager (which
> > +decides when to do a new checkpoint, when to do failover).
> > +There is no internal error when doing failover.
> 
> Not sure this is true.
> 
> Below it says the following for failover: "We will flush the Disk buffer
> into Secondary Disk and stop block replication".  Flushing the disk
> buffer can result in I/O errors.  This means that failover operations
> are not guaranteed to succeed.
> 
> In practice I think this is similar to a successful failover followed by
> immediately getting I/O errors on the new Primary Disk.  It means that
> right after failover there is another failure and the system may not be
> able to continue.

Yes, I think that's true.

> So this really only matters in the case where there is a new Secondary
> ready after failover.  In that case the user might expect failover to
> continue to the new Secondary (Host 3):
> 
>    [X]        [X]
>   Host 1 <-> Host 2 <-> Host 3

Since COLO is just doing a 1+1 redundency, I think it's not expecting to
cope with a double host failure; it's going to take some time (seconds?) to
sync Host 3 back in when you add it after a failover and the aim would
be not to have distrubed the application for that long, so it should
already be running on Host 2 during that resync.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK