From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57957)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <walid.nouri@gmail.com>) id 1XJere-0002RV-PR
	for qemu-devel@nongnu.org; Tue, 19 Aug 2014 04:34:07 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <walid.nouri@gmail.com>) id 1XJerV-0001rz-7s
	for qemu-devel@nongnu.org; Tue, 19 Aug 2014 04:33:58 -0400
Received: from mail-wg0-x229.google.com ([2a00:1450:400c:c00::229]:56085)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <walid.nouri@gmail.com>) id 1XJerU-0001rP-TU
	for qemu-devel@nongnu.org; Tue, 19 Aug 2014 04:33:49 -0400
Received: by mail-wg0-f41.google.com with SMTP id z12so5983628wgg.0
	for <qemu-devel@nongnu.org>; Tue, 19 Aug 2014 01:33:47 -0700 (PDT)
References: <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa>
	<53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa>
	<53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa>
	<53E8FBBD.7050703@gmail.com>
	<53E9247F.4030909@linux.vnet.ibm.com> <53EB7026.805@gmail.com>
	<53EBE672.7050903@linux.vnet.ibm.com>
	<20140814105802.GD2503@work-vm>
Mime-Version: 1.0 (1.0)
In-Reply-To: <20140814105802.GD2503@work-vm>
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Message-Id: <059E3E5C-0FD8-4876-AFB7-617EBC52055C@gmail.com>
From: Walid Nouri <walid.nouri@gmail.com>
Date: Tue, 19 Aug 2014 10:33:45 +0200
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State
	consistency
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "michael@hinespot.com" <michael@hinespot.com>, "hinesmr@cn.ibm.com" <hinesmr@cn.ibm.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "Michael R. Hines" <mrhines@linux.vnet.ibm.com>

Hi,
I have tried to find more information on how to use drive-mirror besides wha=
t is available on the wiki. This was not very satisfactory...

This may sound naive but are there some code examples in "c" or any other la=
nguage, documentation of any kind, blog entries (developer), presentation vi=
deos or any other source of information to get started?

Walid


> Am 14.08.2014 um 12:58 schrieb "Dr. David Alan Gilbert" <dgilbert@redhat.c=
om>:
>=20
> cc'ing in a couple of the COLOers.
>=20
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>>> On 08/13/2014 10:03 PM, Walid Nouri wrote:
>>>=20
>>> While looking to find some ideas for approaches to replicating block
>>> devices I have read the paper about the Remus implementation. I think MC=

>>> can take a similar approach for local disk.
>>=20
>> I agree.
>>=20
>>> Here are the main facts that I have understood:
>>>=20
>>> Local disk contents is viewed as internal state the primary and secondar=
y.
>>> In the explanation they describe that for keeping disc semantics of the
>>> primary and to allow the primary to run speculatively all disc state
>>> changes are directly written to the disk. In parrallel and asynchronousl=
y
>>> send to the secondary. The secondary keeps the pending writing requests i=
n
>>> two disk buffers. A speculation-disk-buffer and a write-out-buffer.
>>>=20
>>> After the reception of the next checkpoint the secondary copies the
>>> speculation buffer to the write out buffer, commits the checkpoint and
>>> applies the write out buffer to its local disk.
>>>=20
>>> When the primary fails the secondary must wait until write-out-buffer ha=
s
>>> been completely written to disk before before changing the execution mod=
e
>>> to run as primary. In this case (failure of primary) the secondary
>>> discards pending disk writes in its speculation buffer. This protocol
>>> keeps the disc state consistent with the last checkpoint.
>>>=20
>>> Remus uses the XEN specific blktap driver. As far as I know this can?t b=
e
>>> used with QEMU (KVM).
>>>=20
>>> I must see how drive-mirror can be used for this kind of protocol.
>>=20
>> That's all correct. Theoretically, we would do exactly the same thing:
>> drive-mirror on the source would write immediately to disk but follow the=

>> same commit semantics on the destination as Xen.
>>=20
>>>=20
>>> I have taken a look at COLO.
>>=20
>>> IMHO there are two points. Custom changes of the TCP-Stack are a no-go f=
or
>>> proprietary operating systems like Windows. It makes COLO application
>>> agnostic but not operating system agnostic. The other point is that with=

>>> I/O intensive workloads COLO will tend to behave like MC. This is my poi=
nt
>>> of view but i didn?t invest much time to understand everything in detail=
.
>>=20
>> Actually, if I remember correctly, the TCP stack is only modified at the
>> hypervisor level - they are intercepting and translating TCP sequence
>> numbers "in-flight" to detect divergence of the source and destination -
>> which is not a big problem if the implementation is well-done.
>=20
> The 2013 paper says:
>   'COLO modifies the guest OS=E2=80=99s TCP/IP stack in order to make the b=
ehavior
>    more deterministic. '
> but does say that an alternative might be to have a
>  ' comparison function that operates transparently over re-assembled TCP s=
treams'
>=20
>> My hope in the future was that the two approaches could be used in a
>> "Hybrid" manner - actually MC has much more of a performance hit for I/O
>> than COLO does because of its buffering requirements.
>>=20
>> On the other hand, MC would perform better in a memory-intensive or
>> CPU-intensive situation - so maybe QEMU could "switch" between the two
>> mechanisms at different points in time when the resource bottleneck chang=
es.
>=20
> If the primary were to rate-limit the number of resynchronisations
> (and send the secondary a message as soon as it knew a resync was needed) t=
hat
> would get some of the way, but then the only difference from microcheckpoi=
nting
> at that point is the secondary doing a wasteful copy and sending the packe=
ts across;
> it seems it should be easy to disable those if it knew that a resync was g=
oing to
> happen.
>=20
> Dave
>=20
>> - Michael
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK