From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51702)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yanghy@cn.fujitsu.com>) id 1XSFlM-0002mr-2o
	for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:35:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <yanghy@cn.fujitsu.com>) id 1XSFlF-0000DM-Sm
	for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:35:00 -0400
Received: from [59.151.112.132] (port=18264 helo=heian.cn.fujitsu.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yanghy@cn.fujitsu.com>) id 1XSFlF-0000DB-1L
	for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:34:53 -0400
Message-ID: <54124DAF.4090700@cn.fujitsu.com>
Date: Fri, 12 Sep 2014 09:34:39 +0800
From: Hongyang Yang <yanghy@cn.fujitsu.com>
MIME-Version: 1.0
References: <53D8FF52.9000104@gmail.com> <1406820870.2680.3.camel@usa>
	<53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa>
	<53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa>
	<53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa>
	<53E8FBBD.7050703@gmail.com>
	<53E92470.60806@linux.vnet.ibm.com> <53F07B73.60407@redhat.com>
	<54107187.8040706@gmail.com> <5410FFE2.40308@linux.vnet.ibm.com>
In-Reply-To: <5410FFE2.40308@linux.vnet.ibm.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State
	consistency
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>, Walid Nouri <walid.nouri@gmail.com>, Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com, "Dr.
	David Alan Gilbert" <dgilbert@redhat.com>, Dong Eddie <eddie.dong@intel.com>, FNST-Gui Jianfeng <GuiJianfeng@cn.fujitsu.com>, wency@cn.fujitsu.com


=E5=9C=A8 09/11/2014 09:50 AM, Michael R. Hines =E5=86=99=E9=81=93:
> On 09/10/2014 11:43 PM, Walid Nouri wrote:
>> Hello Michael, Hello Paolo
>> i have =E2=80=9Estudied=E2=80=9C the available documentation/Information=
 and tried to get an
>> idea of the QEMU live block operation possibilities.
>>
>> I think the MC protocol doesn=E2=80=99t need synchronous block device re=
plication
>> because primary and secondary VM are not synchronous. The state of the p=
rimary
>> is allays ahead of the state of the secondary. When the primary is in ep=
och(n)
>> the secondary is in epoch(n-1).
>>
>> What MC needs is a block device agnostic, controlled and asynchronous ap=
proach
>> for replicating the contents of block devices and its state changes to t=
he
>> secondary VM while the primary VM is running. Asynchronous block transfe=
r is
>> important to allow maximum performance for the primary VM, while keeping=
 the
>> secondary VM updated with state changes.
>>
>> The block device replication should be possible in two stages or modes.
>>
>> The first stage is the live copy of all block devices of the primary to =
the
>> secondary. This is necessary if the secondary doesn=E2=80=99t have an ex=
isting image
>> which is in sync with the primary at the time MC has started. This is no=
t very
>> convenient but as far as I know actually there is no mechanism for persi=
stent
>> dirty bitmap in QEMU.
>>
>> The second stage (mode) is the replication of block device state changes
>> (modified blocks) to keep the image on the secondary in sync with the pr=
imary.
>> The mirrored blocks must be buffered in ram (block buffer) until the com=
plete
>> Checkpoint (RAM, vCPU, device state) can be committed.
>>
>> For keeping the complete system state consistent on the secondary system=
 there
>> must be a possibility for MC to commit/discard block device state change=
s. In
>> normal operation the mirrored block device state changes (block buffer) =
are
>> committed to disk when the complete checkpoint is committed. In case of =
a
>> crash of the primary system while transferring a checkpoint the data in =
the
>> block buffer corresponding to the failed Checkpoint must be discarded.
>>
>> The storage architecture should be =E2=80=9Cshared nothing=E2=80=9D so t=
hat no shared storage
>> is required and primary/secondary can have separate block device images.
>>
>> I think this can be achieved by drive-mirror and a filter block driver.
>> Another approach could be to exploit the block migration functionality o=
f live
>> migration with a filter block driver.
>>
>> The drive-mirror (and live migration) does not rely on shared storage an=
d
>> allow live block device copy and incremental syncing.
>>
>> A block buffer can be implemented with a QEMU filter block driver. It sh=
ould
>> sit at the same position as the Quorum driver in the block driver hierar=
chy.
>> When using block filter approach MC will be transparent and block device
>> agnostic.
>>
>> The block buffer filter must have an Interface which allows MC control t=
he
>> commits or discards of block device state changes. I have no idea where =
to put
>> such an interface to stay conform with QEMU coding style.
>>
>>
>> I=E2=80=99m sure there are alternative and better approaches and I=E2=80=
=99m open for any ideas
>>
>>
>> Walid
>>
>> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
>>> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>>>> Excellent question: QEMU does have a feature called "drive-mirror"
>>>> in block/mirror.c that was introduced a couple of years ago. I'm not
>>>> sure what the
>>>> adoption rate of the feature is, but I would start with that one.
>>>
>>> block/mirror.c is asynchronous, and there's no support for communicatin=
g
>>> checkpoints back to the master. However, the quorum disk driver could
>>> be what you need.
>>>
>>> There's also a series on the mailing list that lets quorum read only
>>> from the primary, so that quorum can still do replication and fault
>>> tolerance, but skip fault detection.
>>>
>>> Paolo
>>>
>>>> There is also a second fault tolerance implementation that works a
>>>> little differently called
>>>> "COLO" - you may have seen those emails on the list too, but their
>>>> method does not require a disk replication solution, if I recall corre=
ctly.
>>>
>>
>
> Nice description of the problem - would you like to put this information =
on the
> MC wiki page? (Just send an email to the list that says "request for wiki
> account, please" in the subject - and they will make an account for you.
>
> A drive-mirror + filter driver solution sounds like a good plan overall,
> of course the devil is in the details =3D)

If I understand correctly, this is similar to our approach. Disk replicatio=
n
is definitely on our plan and we will post RFC patches that include Disk
replication, you can keep an eye on COLO patches:)

>
> I don't know how much time you have to spend on actual code, but even a
> description of what a "theoretical" interface between MC and drive-mirror=
 would
> look like would go a long way even without code.
>
> Your investigations would also help "drive" a solution to this problem fo=
r the
> COLO team as well - I believe they need the same thing....
>
> - Michael
>
> .
>

--=20
Thanks,
Yang.