From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50330)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1XHsjk-0001Bp-Cv
	for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:58:35 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1XHsje-0000GC-6O
	for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:58:28 -0400
Received: from mx1.redhat.com ([209.132.183.28]:13150)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1XHsjd-0000G4-V2
	for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:58:22 -0400
Date: Thu, 14 Aug 2014 11:58:03 +0100
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20140814105802.GD2503@work-vm>
References: <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa>
	<53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa>
	<53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa>
	<53E8FBBD.7050703@gmail.com> <53E9247F.4030909@linux.vnet.ibm.com>
	<53EB7026.805@gmail.com> <53EBE672.7050903@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <53EBE672.7050903@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State
 consistency
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: Walid Nouri <walid.nouri@gmail.com>, hinesmr@cn.ibm.com, qemu-devel@nongnu.org, michael@hinespot.com

cc'ing in a couple of the COLOers.

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 08/13/2014 10:03 PM, Walid Nouri wrote:
> >
> >While looking to find some ideas for approaches to replicating block
> >devices I have read the paper about the Remus implementation. I think MC
> >can take a similar approach for local disk.
> >
>=20
> I agree.
>=20
> >Here are the main facts that I have understood:
> >
> >Local disk contents is viewed as internal state the primary and secondar=
y.
> >In the explanation they describe that for keeping disc semantics of the
> >primary and to allow the primary to run speculatively all disc state
> >changes are directly written to the disk. In parrallel and asynchronously
> >send to the secondary. The secondary keeps the pending writing requests =
in
> >two disk buffers. A speculation-disk-buffer and a write-out-buffer.
> >
> >After the reception of the next checkpoint the secondary copies the
> >speculation buffer to the write out buffer, commits the checkpoint and
> >applies the write out buffer to its local disk.
> >
> >When the primary fails the secondary must wait until write-out-buffer has
> >been completely written to disk before before changing the execution mode
> >to run as primary. In this case (failure of primary) the secondary
> >discards pending disk writes in its speculation buffer. This protocol
> >keeps the disc state consistent with the last checkpoint.
> >
> >Remus uses the XEN specific blktap driver. As far as I know this can?t be
> >used with QEMU (KVM).
> >
> >I must see how drive-mirror can be used for this kind of protocol.
> >
>=20
> That's all correct. Theoretically, we would do exactly the same thing:
> drive-mirror on the source would write immediately to disk but follow the
> same commit semantics on the destination as Xen.
>=20
> >
> >I have taken a look at COLO.
> >
>=20
> >IMHO there are two points. Custom changes of the TCP-Stack are a no-go f=
or
> >proprietary operating systems like Windows. It makes COLO application
> >agnostic but not operating system agnostic. The other point is that with
> >I/O intensive workloads COLO will tend to behave like MC. This is my poi=
nt
> >of view but i didn?t invest much time to understand everything in detail.
> >
>=20
> Actually, if I remember correctly, the TCP stack is only modified at the
> hypervisor level - they are intercepting and translating TCP sequence
> numbers "in-flight" to detect divergence of the source and destination -
> which is not a big problem if the implementation is well-done.

The 2013 paper says:
   'COLO modifies the guest OS=E2=80=99s TCP/IP stack in order to make the =
behavior
    more deterministic. '
but does say that an alternative might be to have a
  ' comparison function that operates transparently over re-assembled TCP s=
treams'

> My hope in the future was that the two approaches could be used in a
> "Hybrid" manner - actually MC has much more of a performance hit for I/O
> than COLO does because of its buffering requirements.
>=20
> On the other hand, MC would perform better in a memory-intensive or
> CPU-intensive situation - so maybe QEMU could "switch" between the two
> mechanisms at different points in time when the resource bottleneck chang=
es.

If the primary were to rate-limit the number of resynchronisations
(and send the secondary a message as soon as it knew a resync was needed) t=
hat
would get some of the way, but then the only difference from microcheckpoin=
ting
at that point is the secondary doing a wasteful copy and sending the packet=
s across;
it seems it should be easy to disable those if it knew that a resync was go=
ing to
happen.

Dave

> - Michael
>=20
>=20
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK