From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51749) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XHnvI-0005hE-Ui for qemu-devel@nongnu.org; Thu, 14 Aug 2014 01:50:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XHnv9-0007qx-Si for qemu-devel@nongnu.org; Thu, 14 Aug 2014 01:50:04 -0400 Received: from e34.co.us.ibm.com ([32.97.110.152]:47196) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XHnv9-0007qr-Lw for qemu-devel@nongnu.org; Thu, 14 Aug 2014 01:49:55 -0400 Received: from /spool/local by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 13 Aug 2014 23:49:53 -0600 Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id A51B119D8026 for ; Wed, 13 Aug 2014 23:49:40 -0600 (MDT) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by b03cxnp07029.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s7E3k0HJ2490852 for ; Thu, 14 Aug 2014 05:46:00 +0200 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s7E5np6j011887 for ; Wed, 13 Aug 2014 23:49:51 -0600 Message-ID: <53EBE672.7050903@linux.vnet.ibm.com> Date: Thu, 14 Aug 2014 06:28:02 +0800 From: "Michael R. Hines" MIME-Version: 1.0 References: <53D8FF52.9000104@gmail.com> <1406820870.2680.3.camel@usa> <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com> <53E9247F.4030909@linux.vnet.ibm.com> <53EB7026.805@gmail.com> In-Reply-To: <53EB7026.805@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Walid Nouri , qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com On 08/13/2014 10:03 PM, Walid Nouri wrote: > > While looking to find some ideas for approaches to replicating block > devices I have read the paper about the Remus implementation. I think > MC can take a similar approach for local disk. > I agree. > Here are the main facts that I have understood: > > Local disk contents is viewed as internal state the primary and > secondary. > In the explanation they describe that for keeping disc semantics of > the primary and to allow the primary to run speculatively all disc > state changes are directly written to the disk. In parrallel and > asynchronously send to the secondary. The secondary keeps the pending > writing requests in two disk buffers. A speculation-disk-buffer and a > write-out-buffer. > > After the reception of the next checkpoint the secondary copies the > speculation buffer to the write out buffer, commits the checkpoint and > applies the write out buffer to its local disk. > > When the primary fails the secondary must wait until write-out-buffer > has been completely written to disk before before changing the > execution mode to run as primary. In this case (failure of primary) > the secondary discards pending disk writes in its speculation buffer. > This protocol keeps the disc state consistent with the last checkpoint. > > Remus uses the XEN specific blktap driver. As far as I know this can’t > be used with QEMU (KVM). > > I must see how drive-mirror can be used for this kind of protocol. > That's all correct. Theoretically, we would do exactly the same thing: drive-mirror on the source would write immediately to disk but follow the same commit semantics on the destination as Xen. > > I have taken a look at COLO. > > IMHO there are two points. Custom changes of the TCP-Stack are a no-go > for proprietary operating systems like Windows. It makes COLO > application agnostic but not operating system agnostic. The other > point is that with I/O intensive workloads COLO will tend to behave > like MC. This is my point of view but i didn’t invest much time to > understand everything in detail. > Actually, if I remember correctly, the TCP stack is only modified at the hypervisor level - they are intercepting and translating TCP sequence numbers "in-flight" to detect divergence of the source and destination - which is not a big problem if the implementation is well-done. My hope in the future was that the two approaches could be used in a "Hybrid" manner - actually MC has much more of a performance hit for I/O than COLO does because of its buffering requirements. On the other hand, MC would perform better in a memory-intensive or CPU-intensive situation - so maybe QEMU could "switch" between the two mechanisms at different points in time when the resource bottleneck changes. - Michael