From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Campbell Subject: Re: [PATCH v4 2/5] remus: resume immediately if libxl__xc_domain_save_done() completes Date: Tue, 19 Jan 2016 11:01:25 +0000 Message-ID: <1453201285.29930.14.camel@citrix.com> References: <1453095622-14859-1-git-send-email-wency@cn.fujitsu.com> <1453095622-14859-3-git-send-email-wency@cn.fujitsu.com> <1453135918.6020.193.camel@citrix.com> <569D8ACF.30508@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <569D8ACF.30508@cn.fujitsu.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wen Congyang , xen devel , Andrew Cooper Cc: Shriram Rajagopalan , Wei Liu , Changlong Xie , Ian Jackson , Yang Hongyang List-Id: xen-devel@lists.xenproject.org On Tue, 2016-01-19 at 09:01 +0800, Wen Congyang wrote: > On 01/19/2016 12:51 AM, Ian Campbell wrote: > > On Mon, 2016-01-18 at 13:40 +0800, Wen Congyang wrote: > > > For example: if the secondary host is down, and we fail to send the > > > data to > > > the secondary host. xc_domain_save() returns 0. So in the function > > > libxl__xc_domain_save_done(), rc is 0(the helper program exits > > > normally), > > > and retval is 0(it is xc_domain_save()'s return value). In such case, > > > we > > > just need to complete the stream. > > > > What if the secondary host isn't actually down but just communication > > has > > failed for some reason? Won't both primary and secondary start their > > respective versions of the domain? What are the consequences of that? > > (Corruption?) > > > > I suppose this is a consequence of the lack of STONITH or splitbrain > > handling within Remus. Are there any plans to address this? > > IIRC, Shriram Rajagopalan has some ideas about it(check the external heartbeat?). > There is no way to avoid splitbrain unless we have more than two hosts(at least > three hosts). If we want to avoid splitbrain, we may need to destroy both primary > and secondary guests. I think there's plenty of existing systems for taking care of this side of fault-tolerance/HA (e.g. linux-ha, Pacemaker, Corosync, etc), we don't need (or want) to reinvent that particular wheel here. I think we just need a story on how one would integrate with such a system in order to say that Remus is properly usable in real world scenarios (i.e. before we can remove the "proof-of-concept" wording from the man page). That might just be a documentation exercise, or it might require some hooks etc adding to (lib)xl in order to allow such integrations, I'm not sure what's needed. IIRC Ian expressed a similar sentiment when Remus support was first added to libxl. Ian.