From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1O5IoM-0001IS-Gw for qemu-devel@nongnu.org; Fri, 23 Apr 2010 09:20:50 -0400 Received: from [140.186.70.92] (port=55614 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1O5IoB-0001G5-Jo for qemu-devel@nongnu.org; Fri, 23 Apr 2010 09:20:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1O5Io9-00014E-9r for qemu-devel@nongnu.org; Fri, 23 Apr 2010 09:20:39 -0400 Received: from e32.co.us.ibm.com ([32.97.110.150]:55907) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1O5Io9-00013x-3N for qemu-devel@nongnu.org; Fri, 23 Apr 2010 09:20:37 -0400 Received: from d03relay01.boulder.ibm.com (d03relay01.boulder.ibm.com [9.17.195.226]) by e32.co.us.ibm.com (8.14.3/8.13.1) with ESMTP id o3NDDeZm015797 for ; Fri, 23 Apr 2010 07:13:40 -0600 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay01.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o3NDKNbt096138 for ; Fri, 23 Apr 2010 07:20:26 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o3NDKMRs028543 for ; Fri, 23 Apr 2010 07:20:23 -0600 Message-ID: <4BD19E95.7030906@linux.vnet.ibm.com> Date: Fri, 23 Apr 2010 08:20:21 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 References: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <4BD00FBA.5040604@redhat.com> <4BD02684.8060202@lab.ntt.co.jp> <4BD03ED8.707@redhat.com> <4BD0B295.3010004@codemonkey.ws> <4BD0FD8F.5060108@lab.ntt.co.jp> In-Reply-To: <4BD0FD8F.5060108@lab.ntt.co.jp> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yoshiaki Tamura Cc: ohmura.kei@lab.ntt.co.jp, mtosatti@redhat.com, kvm@vger.kernel.org, dlaor@redhat.com, qemu-devel@nongnu.org, yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote: > Anthony Liguori wrote: >> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote: >>> 2010/4/22 Dor Laor: >>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: >>>>> Dor Laor wrote: >>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> We have been implementing the prototype of Kemari for KVM, and >>>>>>> we're >>>>>>> sending >>>>>>> this message to share what we have now and TODO lists. >>>>>>> Hopefully, we >>>>>>> would like >>>>>>> to get early feedback to keep us in the right direction. Although >>>>>>> advanced >>>>>>> approaches in the TODO lists are fascinating, we would like to run >>>>>>> this project >>>>>>> step by step while absorbing comments from the community. The >>>>>>> current >>>>>>> code is >>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. >>>>>>> >>>>>>> For those who are new to Kemari for KVM, please take a look at the >>>>>>> following RFC which we posted last year. >>>>>>> >>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html >>>>>>> >>>>>>> The transmission/transaction protocol, and most of the control >>>>>>> logic is >>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent >>>>>>> rip >>>>>>> from >>>>>>> proceeding before synchronizing VMs. It may also need some >>>>>>> plumbing in >>>>>>> the >>>>>>> kernel side to guarantee replayability of certain events and >>>>>>> instructions, >>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA >>>>>>> stack, as well >>>>>>> as for optimization purposes, for example. >>>>>> [ snap] >>>>>> >>>>>>> The rest of this message describes TODO lists grouped by each >>>>>>> topic. >>>>>>> >>>>>>> === event tapping === >>>>>>> >>>>>>> Event tapping is the core component of Kemari, and it decides on >>>>>>> which >>>>>>> event the >>>>>>> primary should synchronize with the secondary. The basic assumption >>>>>>> here is >>>>>>> that outgoing I/O operations are idempotent, which is usually true >>>>>>> for >>>>>>> disk I/O >>>>>>> and reliable network protocols such as TCP. >>>>>> IMO any type of network even should be stalled too. What if the VM >>>>>> runs >>>>>> non tcp protocol and the packet that the master node sent reached >>>>>> some >>>>>> remote client and before the sync to the slave the master failed? >>>>> In current implementation, it is actually stalling any type of >>>>> network >>>>> that goes through virtio-net. >>>>> >>>>> However, if the application was using unreliable protocols, it should >>>>> have its own recovering mechanism, or it should be completely >>>>> stateless. >>>> Why do you treat tcp differently? You can damage the entire VM this >>>> way - >>>> think of dhcp request that was dropped on the moment you switched >>>> between >>>> the master and the slave? >>> I'm not trying to say that we should treat tcp differently, but just >>> it's severe. >>> In case of dhcp request, the client would have a chance to retry after >>> failover, correct? >>> BTW, in current implementation, >> >> I'm slightly confused about the current implementation vs. my >> recollection of the original paper with Xen. I had thought that all disk >> and network I/O was buffered in such a way that at each checkpoint, the >> I/O operations would be released in a burst. Otherwise, you would have >> to synchronize after every I/O operation which is what it seems the >> current implementation does. > > Yes, you're almost right. > It's synchronizing before QEMU starts emulating I/O at each device model. If NodeA is the master and NodeB is the slave, if NodeA sends a network packet, you'll checkpoint before the packet is actually sent, and then if a failure occurs before the next checkpoint, won't that result in both NodeA and NodeB sending out a duplicate version of the packet? Regards, Anthony Liguori