From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:52803)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1XH0Gx-0002Xp-LW
	for qemu-devel@nongnu.org; Mon, 11 Aug 2014 20:49:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1XH0Gl-0003wo-2L
	for qemu-devel@nongnu.org; Mon, 11 Aug 2014 20:49:07 -0400
Received: from e34.co.us.ibm.com ([32.97.110.152]:36528)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1XH0Gk-0003wd-Rj
	for qemu-devel@nongnu.org; Mon, 11 Aug 2014 20:48:54 -0400
Received: from /spool/local
	by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>;
	Mon, 11 Aug 2014 18:48:54 -0600
Received: from b03cxnp08027.gho.boulder.ibm.com
	(b03cxnp08027.gho.boulder.ibm.com [9.17.130.19])
	by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id EED9A3E4003D
	for <qemu-devel@nongnu.org>; Mon, 11 Aug 2014 18:48:51 -0600 (MDT)
Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170])
	by b03cxnp08027.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with
	ESMTP id s7C0mpGi23265310
	for <qemu-devel@nongnu.org>; Tue, 12 Aug 2014 02:48:51 +0200
Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1])
	by d03av04.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP
	id s7C0mpgD020075
	for <qemu-devel@nongnu.org>; Mon, 11 Aug 2014 18:48:51 -0600
Message-ID: <53E9247F.4030909@linux.vnet.ibm.com>
Date: Tue, 12 Aug 2014 04:15:59 +0800
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <53D8FF52.9000104@gmail.com> <1406820870.2680.3.camel@usa>
	<53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa>
	<53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa>
	<53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa>
	<53E8FBBD.7050703@gmail.com>
In-Reply-To: <53E8FBBD.7050703@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State
	consistency
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Walid Nouri <walid.nouri@gmail.com>, qemu-devel@nongnu.org, michael@hinespot.com, Paolo Bonzini <pbonzini@redhat.com>, hinesmr@cn.ibm.com

Excellent question: QEMU does have a feature called "drive-mirror"
in block/mirror.c that was introduced a couple of years ago. I'm not 
sure what the
adoption rate of the feature is, but I would start with that one.

There is also a second fault tolerance implementation that works a 
little differently called
"COLO" - you may have seen those emails on the list too, but their
method does not require a disk replication solution, if I recall correctly.

I know the time pressure that comes during a thesis, though =), so
there's no pressure to work on it - but that is the most pressing issue
in the implementation today. (Lack of disk replication in 
micro-checkpointing.)

The MC implementation also needs to be re-based against the latest
master - I just haven't had a chance to do it yet because some of my
hardware has been taken away from me the last few months - will
see if I can find some reasonable hardware soon.

- Michael

On 08/12/2014 01:22 AM, Walid Nouri wrote:
> Hi,
> I will do my best to make a contribution :-)
>
> Are there alternative ways of replicating local storage other than 
> DRBD that are possibly feasible?
> Some that are directly build into Qemu?
>
> Walid
>
> Am 09.08.2014 14:25, schrieb Michael R. Hines:
>> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote:
>>> Hi Michael,
>>> how is the weather in Bejing? :-)
>> It's terrible. Lots of pollution =(
>>
>>> May I ask you some questions to your MC implementation?
>>>
>>> Currently i'm trying  to understand the general working of the MC
>>> protokoll and possible problems that can occur so that I can discuss it
>>> in my thesis.
>>>
>>> As far as i have understand MC relies on a shared disk. Output of the
>>> primary vm are directly written, network output is buffered until the
>>> corresponding checkpoint is acknowledged.
>>>
>>> One problem that comes into my mind is: What happens when the 
>>> primary vm
>>> writes to the disk and crashes before sending a corresponding 
>>> checkpoint?
>>>
>> The MC implementation itself is incomplete, today. (I need help).
>>
>> The Xen Remus implementation uses the DRBD system to "mirror" all disk
>> writes to the source and destination before completing each checkpoint.
>>
>> The KVM (mc) implementation needs exactly the same support, but it is
>> missing today.
>>
>> Until that happens, we are *required* to use root-over-iSCSI or
>> root-over-NFS (meaning that the guest filesystem is mounted directly
>> inside the virtual machine without the host knowing about it.
>>
>> This has the effect of translating all disk I/O into network I/O,
>> and since network I/O is already buffered, then we are safe.
>>
>>
>>> Here an example: The Primary state is in the actual epoch epoch (n),
>>> secondary state is in epoch (n-1). The primary writes to disk and
>>> crashes before or while sending the checkpoint n. In this case the
>>> secondary memory state is still at epoch (n-1) and the state of the
>>> shared Disk corresponds to the primary state of epoch (n).
>>>
>>> How does MC guaranty that the Disk state of the backup vm is consistent
>>> with its Memory state?
>> As I mentioned above, we need the equivalent of the Xen solution, but I
>> just haven't had the time to write it (or incorporate someone else's
>> implementation). Patch is welcome =)
>>
>>> Is Memory-VCPU / Disk State consistency necessary under all 
>>> circumstances?
>>> Or can this be neglected because the secondary will (after a fail over)
>>> repeat the same instructions and finally write to disk the same (as the
>>> primary before) data for a second time?
>>> Could this lead to fatal inconsistencies?
>>>
>>> Walid
>>>
>>
>>
>>
>> - Michael
>>
>>
>>
>
>