From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50330) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XHsjk-0001Bp-Cv for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:58:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XHsje-0000GC-6O for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:58:28 -0400 Received: from mx1.redhat.com ([209.132.183.28]:13150) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XHsjd-0000G4-V2 for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:58:22 -0400 Date: Thu, 14 Aug 2014 11:58:03 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20140814105802.GD2503@work-vm> References: <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com> <53E9247F.4030909@linux.vnet.ibm.com> <53EB7026.805@gmail.com> <53EBE672.7050903@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <53EBE672.7050903@linux.vnet.ibm.com> Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael R. Hines" Cc: Walid Nouri , hinesmr@cn.ibm.com, qemu-devel@nongnu.org, michael@hinespot.com cc'ing in a couple of the COLOers. * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote: > On 08/13/2014 10:03 PM, Walid Nouri wrote: > > > >While looking to find some ideas for approaches to replicating block > >devices I have read the paper about the Remus implementation. I think MC > >can take a similar approach for local disk. > > >=20 > I agree. >=20 > >Here are the main facts that I have understood: > > > >Local disk contents is viewed as internal state the primary and secondar= y. > >In the explanation they describe that for keeping disc semantics of the > >primary and to allow the primary to run speculatively all disc state > >changes are directly written to the disk. In parrallel and asynchronously > >send to the secondary. The secondary keeps the pending writing requests = in > >two disk buffers. A speculation-disk-buffer and a write-out-buffer. > > > >After the reception of the next checkpoint the secondary copies the > >speculation buffer to the write out buffer, commits the checkpoint and > >applies the write out buffer to its local disk. > > > >When the primary fails the secondary must wait until write-out-buffer has > >been completely written to disk before before changing the execution mode > >to run as primary. In this case (failure of primary) the secondary > >discards pending disk writes in its speculation buffer. This protocol > >keeps the disc state consistent with the last checkpoint. > > > >Remus uses the XEN specific blktap driver. As far as I know this can?t be > >used with QEMU (KVM). > > > >I must see how drive-mirror can be used for this kind of protocol. > > >=20 > That's all correct. Theoretically, we would do exactly the same thing: > drive-mirror on the source would write immediately to disk but follow the > same commit semantics on the destination as Xen. >=20 > > > >I have taken a look at COLO. > > >=20 > >IMHO there are two points. Custom changes of the TCP-Stack are a no-go f= or > >proprietary operating systems like Windows. It makes COLO application > >agnostic but not operating system agnostic. The other point is that with > >I/O intensive workloads COLO will tend to behave like MC. This is my poi= nt > >of view but i didn?t invest much time to understand everything in detail. > > >=20 > Actually, if I remember correctly, the TCP stack is only modified at the > hypervisor level - they are intercepting and translating TCP sequence > numbers "in-flight" to detect divergence of the source and destination - > which is not a big problem if the implementation is well-done. The 2013 paper says: 'COLO modifies the guest OS=E2=80=99s TCP/IP stack in order to make the = behavior more deterministic. ' but does say that an alternative might be to have a ' comparison function that operates transparently over re-assembled TCP s= treams' > My hope in the future was that the two approaches could be used in a > "Hybrid" manner - actually MC has much more of a performance hit for I/O > than COLO does because of its buffering requirements. >=20 > On the other hand, MC would perform better in a memory-intensive or > CPU-intensive situation - so maybe QEMU could "switch" between the two > mechanisms at different points in time when the resource bottleneck chang= es. If the primary were to rate-limit the number of resynchronisations (and send the secondary a message as soon as it knew a resync was needed) t= hat would get some of the way, but then the only difference from microcheckpoin= ting at that point is the secondary doing a wasteful copy and sending the packet= s across; it seems it should be easy to disable those if it knew that a resync was go= ing to happen. Dave > - Michael >=20 >=20 -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK