From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Date: Fri, 23 Apr 2010 10:53:19 +0900
Message-ID: <4BD0FD8F.5060108@lab.ntt.co.jp>
References: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>	 <4BD00FBA.5040604@redhat.com> <4BD02684.8060202@lab.ntt.co.jp>	 <4BD03ED8.707@redhat.com> <x2y87e9effc1004220616jdee0e344n634c67ab5565d755@mail.gmail.com> <4BD0B295.3010004@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: dlaor@redhat.com, ohmura.kei@lab.ntt.co.jp, kvm@vger.kernel.org,
	mtosatti@redhat.com, aliguori@us.ibm.com, qemu-devel@nongnu.org,
	yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-owner@vger.kernel.org>
Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:60353 "EHLO
	tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752760Ab0DWBxm (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 22 Apr 2010 21:53:42 -0400
In-Reply-To: <4BD0B295.3010004@codemonkey.ws>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Anthony Liguori wrote:
> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>> Dor Laor wrote:
>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>>> sending
>>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>>> would like
>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>> advanced
>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>> this project
>>>>>> step by step while absorbing comments from the community. The current
>>>>>> code is
>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>
>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>> following RFC which we posted last year.
>>>>>>
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>
>>>>>> The transmission/transaction protocol, and most of the control
>>>>>> logic is
>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>>> from
>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>> plumbing in
>>>>>> the
>>>>>> kernel side to guarantee replayability of certain events and
>>>>>> instructions,
>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>> stack, as well
>>>>>> as for optimization purposes, for example.
>>>>> [ snap]
>>>>>
>>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>>
>>>>>> === event tapping ===
>>>>>>
>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>> which
>>>>>> event the
>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>> here is
>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>> for
>>>>>> disk I/O
>>>>>> and reliable network protocols such as TCP.
>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>> runs
>>>>> non tcp protocol and the packet that the master node sent reached some
>>>>> remote client and before the sync to the slave the master failed?
>>>> In current implementation, it is actually stalling any type of network
>>>> that goes through virtio-net.
>>>>
>>>> However, if the application was using unreliable protocols, it should
>>>> have its own recovering mechanism, or it should be completely
>>>> stateless.
>>> Why do you treat tcp differently? You can damage the entire VM this
>>> way -
>>> think of dhcp request that was dropped on the moment you switched
>>> between
>>> the master and the slave?
>> I'm not trying to say that we should treat tcp differently, but just
>> it's severe.
>> In case of dhcp request, the client would have a chance to retry after
>> failover, correct?
>> BTW, in current implementation,
>
> I'm slightly confused about the current implementation vs. my
> recollection of the original paper with Xen. I had thought that all disk
> and network I/O was buffered in such a way that at each checkpoint, the
> I/O operations would be released in a burst. Otherwise, you would have
> to synchronize after every I/O operation which is what it seems the
> current implementation does.

Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.
It was originally designed that way to avoid complexity of introducing buffering 
mechanism and additional I/O latency by buffering.

> I'm not sure how that is accomplished
> atomically though since you could have a completed I/O operation
> duplicated on the slave node provided it didn't notify completion prior
> to failure.

That's exactly the point I wanted to discuss.
Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() 
before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is 
complete.  I mimicked what existing live migration is doing.
It's not enough?

> Is there another kemari component that somehow handles buffering I/O
> that is not obvious from these patches?

No, I'm not hiding anything, and I would share any information regarding Kemari 
to develop it in this community :-)

Thanks,

Yoshi

>
> Regards,
>
> Anthony Liguori
>
>
>