From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yoshiaki Tamura Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Date: Fri, 23 Apr 2010 10:53:19 +0900 Message-ID: <4BD0FD8F.5060108@lab.ntt.co.jp> References: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <4BD00FBA.5040604@redhat.com> <4BD02684.8060202@lab.ntt.co.jp> <4BD03ED8.707@redhat.com> <4BD0B295.3010004@codemonkey.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: dlaor@redhat.com, ohmura.kei@lab.ntt.co.jp, kvm@vger.kernel.org, mtosatti@redhat.com, aliguori@us.ibm.com, qemu-devel@nongnu.org, yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com To: Anthony Liguori Return-path: Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:60353 "EHLO tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752760Ab0DWBxm (ORCPT ); Thu, 22 Apr 2010 21:53:42 -0400 In-Reply-To: <4BD0B295.3010004@codemonkey.ws> Sender: kvm-owner@vger.kernel.org List-ID: Anthony Liguori wrote: > On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote: >> 2010/4/22 Dor Laor: >>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: >>>> Dor Laor wrote: >>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: >>>>>> Hi all, >>>>>> >>>>>> We have been implementing the prototype of Kemari for KVM, and we're >>>>>> sending >>>>>> this message to share what we have now and TODO lists. Hopefully, we >>>>>> would like >>>>>> to get early feedback to keep us in the right direction. Although >>>>>> advanced >>>>>> approaches in the TODO lists are fascinating, we would like to run >>>>>> this project >>>>>> step by step while absorbing comments from the community. The current >>>>>> code is >>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. >>>>>> >>>>>> For those who are new to Kemari for KVM, please take a look at the >>>>>> following RFC which we posted last year. >>>>>> >>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html >>>>>> >>>>>> The transmission/transaction protocol, and most of the control >>>>>> logic is >>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip >>>>>> from >>>>>> proceeding before synchronizing VMs. It may also need some >>>>>> plumbing in >>>>>> the >>>>>> kernel side to guarantee replayability of certain events and >>>>>> instructions, >>>>>> integrate the RAS capabilities of newer x86 hardware with the HA >>>>>> stack, as well >>>>>> as for optimization purposes, for example. >>>>> [ snap] >>>>> >>>>>> The rest of this message describes TODO lists grouped by each topic. >>>>>> >>>>>> === event tapping === >>>>>> >>>>>> Event tapping is the core component of Kemari, and it decides on >>>>>> which >>>>>> event the >>>>>> primary should synchronize with the secondary. The basic assumption >>>>>> here is >>>>>> that outgoing I/O operations are idempotent, which is usually true >>>>>> for >>>>>> disk I/O >>>>>> and reliable network protocols such as TCP. >>>>> IMO any type of network even should be stalled too. What if the VM >>>>> runs >>>>> non tcp protocol and the packet that the master node sent reached some >>>>> remote client and before the sync to the slave the master failed? >>>> In current implementation, it is actually stalling any type of network >>>> that goes through virtio-net. >>>> >>>> However, if the application was using unreliable protocols, it should >>>> have its own recovering mechanism, or it should be completely >>>> stateless. >>> Why do you treat tcp differently? You can damage the entire VM this >>> way - >>> think of dhcp request that was dropped on the moment you switched >>> between >>> the master and the slave? >> I'm not trying to say that we should treat tcp differently, but just >> it's severe. >> In case of dhcp request, the client would have a chance to retry after >> failover, correct? >> BTW, in current implementation, > > I'm slightly confused about the current implementation vs. my > recollection of the original paper with Xen. I had thought that all disk > and network I/O was buffered in such a way that at each checkpoint, the > I/O operations would be released in a burst. Otherwise, you would have > to synchronize after every I/O operation which is what it seems the > current implementation does. Yes, you're almost right. It's synchronizing before QEMU starts emulating I/O at each device model. It was originally designed that way to avoid complexity of introducing buffering mechanism and additional I/O latency by buffering. > I'm not sure how that is accomplished > atomically though since you could have a completed I/O operation > duplicated on the slave node provided it didn't notify completion prior > to failure. That's exactly the point I wanted to discuss. Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is complete. I mimicked what existing live migration is doing. It's not enough? > Is there another kemari component that somehow handles buffering I/O > that is not obvious from these patches? No, I'm not hiding anything, and I would share any information regarding Kemari to develop it in this community :-) Thanks, Yoshi > > Regards, > > Anthony Liguori > > >