From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Date: Mon, 26 Apr 2010 19:44:11 +0900
Message-ID: <4BD56E7B.30906@lab.ntt.co.jp>
References: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <4BD00FBA.5040604@redhat.com> <4BD02684.8060202@lab.ntt.co.jp> <4BD03ED8.707@redhat.com> <x2y87e9effc1004220616jdee0e344n634c67ab5565d755@mail.gmail.com> <4BD0B295.3010004@codemonkey.ws> <4BD0FD8F.5060108@lab.ntt.co.jp> <4BD19E95.7030906@linux.vnet.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: dlaor@redhat.com, ohmura.kei@lab.ntt.co.jp, kvm@vger.kernel.org,
	mtosatti@redhat.com, qemu-devel@nongnu.org,
	yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com
To: Anthony Liguori <aliguori@linux.vnet.ibm.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:54421 "EHLO
	tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752261Ab0DZKoX (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 26 Apr 2010 06:44:23 -0400
In-Reply-To: <4BD19E95.7030906@linux.vnet.ibm.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Anthony Liguori wrote:
> On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:
>> Anthony Liguori wrote:
>>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>>>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>>> Dor Laor wrote:
>>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We have been implementing the prototype of Kemari for KVM, and
>>>>>>>> we're
>>>>>>>> sending
>>>>>>>> this message to share what we have now and TODO lists.
>>>>>>>> Hopefully, we
>>>>>>>> would like
>>>>>>>> to get early feedback to keep us in the right direction. Altho=
ugh
>>>>>>>> advanced
>>>>>>>> approaches in the TODO lists are fascinating, we would like to=
 run
>>>>>>>> this project
>>>>>>>> step by step while absorbing comments from the community. The
>>>>>>>> current
>>>>>>>> code is
>>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27=
=2E
>>>>>>>>
>>>>>>>> For those who are new to Kemari for KVM, please take a look at=
 the
>>>>>>>> following RFC which we posted last year.
>>>>>>>>
>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>>>
>>>>>>>> The transmission/transaction protocol, and most of the control
>>>>>>>> logic is
>>>>>>>> implemented in QEMU. However, we needed a hack in KVM to preve=
nt
>>>>>>>> rip
>>>>>>>> from
>>>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>>>> plumbing in
>>>>>>>> the
>>>>>>>> kernel side to guarantee replayability of certain events and
>>>>>>>> instructions,
>>>>>>>> integrate the RAS capabilities of newer x86 hardware with the =
HA
>>>>>>>> stack, as well
>>>>>>>> as for optimization purposes, for example.
>>>>>>> [ snap]
>>>>>>>
>>>>>>>> The rest of this message describes TODO lists grouped by each
>>>>>>>> topic.
>>>>>>>>
>>>>>>>> =3D=3D=3D event tapping =3D=3D=3D
>>>>>>>>
>>>>>>>> Event tapping is the core component of Kemari, and it decides =
on
>>>>>>>> which
>>>>>>>> event the
>>>>>>>> primary should synchronize with the secondary. The basic assum=
ption
>>>>>>>> here is
>>>>>>>> that outgoing I/O operations are idempotent, which is usually =
true
>>>>>>>> for
>>>>>>>> disk I/O
>>>>>>>> and reliable network protocols such as TCP.
>>>>>>> IMO any type of network even should be stalled too. What if the=
 VM
>>>>>>> runs
>>>>>>> non tcp protocol and the packet that the master node sent reach=
ed
>>>>>>> some
>>>>>>> remote client and before the sync to the slave the master faile=
d?
>>>>>> In current implementation, it is actually stalling any type of
>>>>>> network
>>>>>> that goes through virtio-net.
>>>>>>
>>>>>> However, if the application was using unreliable protocols, it s=
hould
>>>>>> have its own recovering mechanism, or it should be completely
>>>>>> stateless.
>>>>> Why do you treat tcp differently? You can damage the entire VM th=
is
>>>>> way -
>>>>> think of dhcp request that was dropped on the moment you switched
>>>>> between
>>>>> the master and the slave?
>>>> I'm not trying to say that we should treat tcp differently, but ju=
st
>>>> it's severe.
>>>> In case of dhcp request, the client would have a chance to retry a=
fter
>>>> failover, correct?
>>>> BTW, in current implementation,
>>>
>>> I'm slightly confused about the current implementation vs. my
>>> recollection of the original paper with Xen. I had thought that all=
 disk
>>> and network I/O was buffered in such a way that at each checkpoint,=
 the
>>> I/O operations would be released in a burst. Otherwise, you would h=
ave
>>> to synchronize after every I/O operation which is what it seems the
>>> current implementation does.
>>
>> Yes, you're almost right.
>> It's synchronizing before QEMU starts emulating I/O at each device m=
odel.
>
> If NodeA is the master and NodeB is the slave, if NodeA sends a netwo=
rk
> packet, you'll checkpoint before the packet is actually sent, and the=
n
> if a failure occurs before the next checkpoint, won't that result in
> both NodeA and NodeB sending out a duplicate version of the packet?

Yes.  But I think it's better than taking checkpoint after.

If we checkpoint after sending packet, let's say it sent TCP ACK to the=
 client,=20
and if a hardware failure occurred to NodeA during the transaction *but=
 the=20
client received the TCP ACK*, NodeB will resume from the previous state=
, and it=20
may need to receive some data from the client. However, because the cli=
ent has=20
already receiver TCP=E3=80=80ACK, it won't resend the data to NodeB.  I=
t looks this=20
data is going to be dropped.

Anyway, I've just started planning to move the sync point to network/bl=
ock=20
layer, and I would post the result for discussion again.