From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Dr. David Alan Gilbert" Subject: Re: [Qemu-devel] [RFC] COLO HA Project proposal Date: Fri, 4 Jul 2014 09:35:46 +0100 Message-ID: <20140704083546.GC2425@work-vm> References: <53A8DD80.7070905@cn.fujitsu.com> <20140701121248.GH2394@work-vm> <53B4D133.4060903@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8BIT Cc: Hongyang Yang , "qemu-devel@nongnu.org" , FNST-Gui Jianfeng , "kvm@vger.kernel.org" , Wen Congyang To: "Dong, Eddie" Return-path: Received: from mx1.redhat.com ([209.132.183.28]:19375 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750999AbaGDIf7 convert rfc822-to-8bit (ORCPT ); Fri, 4 Jul 2014 04:35:59 -0400 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: * Dong, Eddie (eddie.dong@intel.com) wrote: > > > > > > I didn't quite understand a couple of things though, perhaps you can > > > explain: > > > 1) If we ignore the TCP sequence number problem, in an SMP machine > > > don't we get other randomnesses - e.g. which core completes something > > > first, or who wins a lock contention, so the output stream might not > > > be identical - so do those normal bits of randomness cause the > > > machines to flag as out-of-sync? > > > > It's about COLO agent, CCing Congyang, he can give the detailed > > explanation. > > > > Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a > new implementation to make the sequence number to be best effort identical > between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize > the emulation of randomization number generation mechanism between the > PVM and SVM, like the lock-stepping mechanism does. > > Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the > identical Sequence number both in PVM and SVM. That wasn't really my question; I was worrying about other forms of randomness, such as winners of lock contention, and other SMP non-determinisms, and I'm also worried by what proportion of time the system can't recover from a failure due to being unable to distinguish an SVM failure from a randomness issue. Dave > > > Thanks, Eddie -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK