From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52803) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XH0Gx-0002Xp-LW for qemu-devel@nongnu.org; Mon, 11 Aug 2014 20:49:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XH0Gl-0003wo-2L for qemu-devel@nongnu.org; Mon, 11 Aug 2014 20:49:07 -0400 Received: from e34.co.us.ibm.com ([32.97.110.152]:36528) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XH0Gk-0003wd-Rj for qemu-devel@nongnu.org; Mon, 11 Aug 2014 20:48:54 -0400 Received: from /spool/local by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 11 Aug 2014 18:48:54 -0600 Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id EED9A3E4003D for ; Mon, 11 Aug 2014 18:48:51 -0600 (MDT) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by b03cxnp08027.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s7C0mpGi23265310 for ; Tue, 12 Aug 2014 02:48:51 +0200 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s7C0mpgD020075 for ; Mon, 11 Aug 2014 18:48:51 -0600 Message-ID: <53E9247F.4030909@linux.vnet.ibm.com> Date: Tue, 12 Aug 2014 04:15:59 +0800 From: "Michael R. Hines" MIME-Version: 1.0 References: <53D8FF52.9000104@gmail.com> <1406820870.2680.3.camel@usa> <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com> In-Reply-To: <53E8FBBD.7050703@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Walid Nouri , qemu-devel@nongnu.org, michael@hinespot.com, Paolo Bonzini , hinesmr@cn.ibm.com Excellent question: QEMU does have a feature called "drive-mirror" in block/mirror.c that was introduced a couple of years ago. I'm not sure what the adoption rate of the feature is, but I would start with that one. There is also a second fault tolerance implementation that works a little differently called "COLO" - you may have seen those emails on the list too, but their method does not require a disk replication solution, if I recall correctly. I know the time pressure that comes during a thesis, though =), so there's no pressure to work on it - but that is the most pressing issue in the implementation today. (Lack of disk replication in micro-checkpointing.) The MC implementation also needs to be re-based against the latest master - I just haven't had a chance to do it yet because some of my hardware has been taken away from me the last few months - will see if I can find some reasonable hardware soon. - Michael On 08/12/2014 01:22 AM, Walid Nouri wrote: > Hi, > I will do my best to make a contribution :-) > > Are there alternative ways of replicating local storage other than > DRBD that are possibly feasible? > Some that are directly build into Qemu? > > Walid > > Am 09.08.2014 14:25, schrieb Michael R. Hines: >> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote: >>> Hi Michael, >>> how is the weather in Bejing? :-) >> It's terrible. Lots of pollution =( >> >>> May I ask you some questions to your MC implementation? >>> >>> Currently i'm trying to understand the general working of the MC >>> protokoll and possible problems that can occur so that I can discuss it >>> in my thesis. >>> >>> As far as i have understand MC relies on a shared disk. Output of the >>> primary vm are directly written, network output is buffered until the >>> corresponding checkpoint is acknowledged. >>> >>> One problem that comes into my mind is: What happens when the >>> primary vm >>> writes to the disk and crashes before sending a corresponding >>> checkpoint? >>> >> The MC implementation itself is incomplete, today. (I need help). >> >> The Xen Remus implementation uses the DRBD system to "mirror" all disk >> writes to the source and destination before completing each checkpoint. >> >> The KVM (mc) implementation needs exactly the same support, but it is >> missing today. >> >> Until that happens, we are *required* to use root-over-iSCSI or >> root-over-NFS (meaning that the guest filesystem is mounted directly >> inside the virtual machine without the host knowing about it. >> >> This has the effect of translating all disk I/O into network I/O, >> and since network I/O is already buffered, then we are safe. >> >> >>> Here an example: The Primary state is in the actual epoch epoch (n), >>> secondary state is in epoch (n-1). The primary writes to disk and >>> crashes before or while sending the checkpoint n. In this case the >>> secondary memory state is still at epoch (n-1) and the state of the >>> shared Disk corresponds to the primary state of epoch (n). >>> >>> How does MC guaranty that the Disk state of the backup vm is consistent >>> with its Memory state? >> As I mentioned above, we need the equivalent of the Xen solution, but I >> just haven't had the time to write it (or incorporate someone else's >> implementation). Patch is welcome =) >> >>> Is Memory-VCPU / Disk State consistency necessary under all >>> circumstances? >>> Or can this be neglected because the secondary will (after a fail over) >>> repeat the same instructions and finally write to disk the same (as the >>> primary before) data for a second time? >>> Could this lead to fatal inconsistencies? >>> >>> Walid >>> >> >> >> >> - Michael >> >> >> > >