From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58501) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aG7Q2-0004AM-BQ for qemu-devel@nongnu.org; Mon, 04 Jan 2016 10:51:39 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aG7Q1-00088K-D4 for qemu-devel@nongnu.org; Mon, 04 Jan 2016 10:51:38 -0500 Date: Mon, 4 Jan 2016 15:51:26 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20160104155126.GH2529@work-vm> References: <1449034311-4094-1-git-send-email-wency@cn.fujitsu.com> <1449034311-4094-6-git-send-email-wency@cn.fujitsu.com> <20151223092603.GA11394@stefanha-x1.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151223092603.GA11394@stefanha-x1.localdomain> Subject: Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , Fam Zheng , qemu block , Jiang Yunhong , Dong Eddie , qemu devel , "Michael R. Hines" , Max Reitz , Gonglei , Paolo Bonzini , zhanghailiang * Stefan Hajnoczi (stefanha@redhat.com) wrote: > On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote: > > +== Failure Handling == > > +There are 6 internal errors when block replication is running: > > +1. I/O error on primary disk > > +2. Forwarding primary write requests failed > > +3. Backup failed > > +4. I/O error on secondary disk > > +5. I/O error on active disk > > +6. Making active disk or hidden disk empty failed > > +In case 1 and 5, we just report the error to the disk layer. In case 2, 3, > > +4 and 6, we just report block replication's error to FT/HA manager (which > > +decides when to do a new checkpoint, when to do failover). > > +There is no internal error when doing failover. > > Not sure this is true. > > Below it says the following for failover: "We will flush the Disk buffer > into Secondary Disk and stop block replication". Flushing the disk > buffer can result in I/O errors. This means that failover operations > are not guaranteed to succeed. > > In practice I think this is similar to a successful failover followed by > immediately getting I/O errors on the new Primary Disk. It means that > right after failover there is another failure and the system may not be > able to continue. Yes, I think that's true. > So this really only matters in the case where there is a new Secondary > ready after failover. In that case the user might expect failover to > continue to the new Secondary (Host 3): > > [X] [X] > Host 1 <-> Host 2 <-> Host 3 Since COLO is just doing a 1+1 redundency, I think it's not expecting to cope with a double host failure; it's going to take some time (seconds?) to sync Host 3 back in when you add it after a failover and the aim would be not to have distrubed the application for that long, so it should already be running on Host 2 during that resync. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK