From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60379)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yanghy@cn.fujitsu.com>) id 1X2XvA-0000KI-LX
	for qemu-devel@nongnu.org; Wed, 02 Jul 2014 23:42:56 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <yanghy@cn.fujitsu.com>) id 1X2Xv2-0001Yk-Nz
	for qemu-devel@nongnu.org; Wed, 02 Jul 2014 23:42:52 -0400
Received: from [59.151.112.132] (port=61219 helo=heian.cn.fujitsu.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yanghy@cn.fujitsu.com>) id 1X2Xv2-0001Y5-2s
	for qemu-devel@nongnu.org; Wed, 02 Jul 2014 23:42:44 -0400
Message-ID: <53B4D133.4060903@cn.fujitsu.com>
Date: Thu, 3 Jul 2014 11:42:43 +0800
From: Hongyang Yang <yanghy@cn.fujitsu.com>
MIME-Version: 1.0
References: <53A8DD80.7070905@cn.fujitsu.com> <20140701121248.GH2394@work-vm>
In-Reply-To: <20140701121248.GH2394@work-vm>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] COLO HA Project proposal
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: FNST-Gui Jianfeng <GuiJianfeng@cn.fujitsu.com>, Dong Eddie <eddie.dong@intel.com>, qemu-devel@nongnu.org, kvm@vger.kernel.org

Hi David,

On 07/01/2014 08:12 PM, Dr. David Alan Gilbert wrote:
> * Hongyang Yang (yanghy@cn.fujitsu.com) wrote:
>
> Hi Yang,
>
>> Background:
>>    COLO HA project is a high availability solution. Both primary
>> VM (PVM) and secondary VM (SVM) run in parallel. They receive the
>> same request from client, and generate response in parallel too.
>> If the response packets from PVM and SVM are identical, they are
>> released immediately. Otherwise, a VM checkpoint (on demand) is
>> conducted. The idea is presented in Xen summit 2012, and 2013,
>> and academia paper in SOCC 2013. It's also presented in KVM forum
>> 2013:
>> http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
>> Please refer to above document for detailed information.
>
> Yes, I remember that talk - very interesting.
>
> I didn't quite understand a couple of things though, perhaps you
> can explain:
>    1) If we ignore the TCP sequence number problem, in an SMP machine
> don't we get other randomnesses - e.g. which core completes something
> first, or who wins a lock contention, so the output stream might not
> be identical - so do those normal bits of randomness cause the machines
> to flag as out-of-sync?

It's about COLO agent, CCing Congyang, he can give the detailed
explanation.

>
>    2) If the PVM has decided that the SVM is out of sync (due to 1) and
> the PVM fails at about the same point - can we switch over to the SVM?

Yes, we can switch over, we have some mechanisms to ensure the SVM's state
is consentient:
- memory cache.
   The memory cache was initially the same as PVM's memory. At
checkpoint, we cache the dirty memory of PVM while transporting the
memory, write cached memory to SVM when we received all PVM memory
(we only need to write memory that was both dirty on PVM and SVM
from last checkpoint). This solves problem 2) you've mentioned above:
If PVM fails while checkpointing, SVM will discard the cached memory
and continue to run and to provide service just as it is.

- COLO Disk manager
   Like memory cache, COLO Disk manager caches the Disk modifications
of PVM, and write it to SVM Disk when checkpointing. If PVM fails while
checkpointing, SVM will discard the cached Disk modifications.

>
> I'm worried that due to (1) there are periods where the system
> is out-of-sync and a failure of the PVM is not protected.  Does that happen?
> If so how often?
>
>> The attached was the architecture of kvm-COLO we proposed.
>>    - COLO Manager: Requires modifications of qemu
>>      - COLO Controller
>>          COLO Controller includes modifications of save/restore
>>        flow just like MC(macrocheckpoint), a memory cache on
>>        secondary VM which cache the dirty pages of primary VM
>>        and a failover module which provides APIs to communicate
>>        with external heartbead module.
>>      - COLO Disk Manager
>>          When pvm writes data into image, the colo disk manger
>>        captures this data and send it to the colo disk manger
>>        which makes sure the context of svm's image is consentient
>>        with the context of pvm's image.
>
> I wonder if there is anyway to coordinate this between COLO, Michael
> Hines microcheckpointing and the two separate reverse-execution
> projects that also need to do some similar things.
> Are there any standard APIs for the heartbeet thing we can already
> tie into?

Sadly we have checked MC, it does not have heartbeat support for now.

>
>>    - COLO Agent("Proxy module" in the arch picture)
>>        We need an agent to compare the packets returned by
>>      Primary VM and Secondary VM, and decide whether to start a
>>      checkpoint according to some rules. It is a linux kernel
>>      module for host.
>
> Why is that a kernel module, and how does it communicate the state
> to the QEMU instance?

The reason we made this a kernel module is to gain better performance.
We can easily hook the packets in a kernel module.
QEMU instance uses ioctl() to communicate with the COLO Agent.

>
>>    - Other minor modifications
>>        We may need other modifications for better performance.
>
> Dave
> P.S. I'm starting to look at fault-tolerance stuff, but haven't
> got very far yet, so starting to try and understand the details
> of COLO, microcheckpointing, etc
>
>> --
>> Thanks,
>> Yang.
>
>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>

-- 
Thanks,
Yang.