From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yoshiaki Tamura Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Date: Mon, 26 Apr 2010 19:44:11 +0900 Message-ID: <4BD56E7B.30906@lab.ntt.co.jp> References: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> <4BD00FBA.5040604@redhat.com> <4BD02684.8060202@lab.ntt.co.jp> <4BD03ED8.707@redhat.com> <4BD0B295.3010004@codemonkey.ws> <4BD0FD8F.5060108@lab.ntt.co.jp> <4BD19E95.7030906@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: dlaor@redhat.com, ohmura.kei@lab.ntt.co.jp, kvm@vger.kernel.org, mtosatti@redhat.com, qemu-devel@nongnu.org, yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com To: Anthony Liguori Return-path: Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:54421 "EHLO tama50.ecl.ntt.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752261Ab0DZKoX (ORCPT ); Mon, 26 Apr 2010 06:44:23 -0400 In-Reply-To: <4BD19E95.7030906@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org List-ID: Anthony Liguori wrote: > On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote: >> Anthony Liguori wrote: >>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote: >>>> 2010/4/22 Dor Laor: >>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: >>>>>> Dor Laor wrote: >>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> We have been implementing the prototype of Kemari for KVM, and >>>>>>>> we're >>>>>>>> sending >>>>>>>> this message to share what we have now and TODO lists. >>>>>>>> Hopefully, we >>>>>>>> would like >>>>>>>> to get early feedback to keep us in the right direction. Altho= ugh >>>>>>>> advanced >>>>>>>> approaches in the TODO lists are fascinating, we would like to= run >>>>>>>> this project >>>>>>>> step by step while absorbing comments from the community. The >>>>>>>> current >>>>>>>> code is >>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27= =2E >>>>>>>> >>>>>>>> For those who are new to Kemari for KVM, please take a look at= the >>>>>>>> following RFC which we posted last year. >>>>>>>> >>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html >>>>>>>> >>>>>>>> The transmission/transaction protocol, and most of the control >>>>>>>> logic is >>>>>>>> implemented in QEMU. However, we needed a hack in KVM to preve= nt >>>>>>>> rip >>>>>>>> from >>>>>>>> proceeding before synchronizing VMs. It may also need some >>>>>>>> plumbing in >>>>>>>> the >>>>>>>> kernel side to guarantee replayability of certain events and >>>>>>>> instructions, >>>>>>>> integrate the RAS capabilities of newer x86 hardware with the = HA >>>>>>>> stack, as well >>>>>>>> as for optimization purposes, for example. >>>>>>> [ snap] >>>>>>> >>>>>>>> The rest of this message describes TODO lists grouped by each >>>>>>>> topic. >>>>>>>> >>>>>>>> =3D=3D=3D event tapping =3D=3D=3D >>>>>>>> >>>>>>>> Event tapping is the core component of Kemari, and it decides = on >>>>>>>> which >>>>>>>> event the >>>>>>>> primary should synchronize with the secondary. The basic assum= ption >>>>>>>> here is >>>>>>>> that outgoing I/O operations are idempotent, which is usually = true >>>>>>>> for >>>>>>>> disk I/O >>>>>>>> and reliable network protocols such as TCP. >>>>>>> IMO any type of network even should be stalled too. What if the= VM >>>>>>> runs >>>>>>> non tcp protocol and the packet that the master node sent reach= ed >>>>>>> some >>>>>>> remote client and before the sync to the slave the master faile= d? >>>>>> In current implementation, it is actually stalling any type of >>>>>> network >>>>>> that goes through virtio-net. >>>>>> >>>>>> However, if the application was using unreliable protocols, it s= hould >>>>>> have its own recovering mechanism, or it should be completely >>>>>> stateless. >>>>> Why do you treat tcp differently? You can damage the entire VM th= is >>>>> way - >>>>> think of dhcp request that was dropped on the moment you switched >>>>> between >>>>> the master and the slave? >>>> I'm not trying to say that we should treat tcp differently, but ju= st >>>> it's severe. >>>> In case of dhcp request, the client would have a chance to retry a= fter >>>> failover, correct? >>>> BTW, in current implementation, >>> >>> I'm slightly confused about the current implementation vs. my >>> recollection of the original paper with Xen. I had thought that all= disk >>> and network I/O was buffered in such a way that at each checkpoint,= the >>> I/O operations would be released in a burst. Otherwise, you would h= ave >>> to synchronize after every I/O operation which is what it seems the >>> current implementation does. >> >> Yes, you're almost right. >> It's synchronizing before QEMU starts emulating I/O at each device m= odel. > > If NodeA is the master and NodeB is the slave, if NodeA sends a netwo= rk > packet, you'll checkpoint before the packet is actually sent, and the= n > if a failure occurs before the next checkpoint, won't that result in > both NodeA and NodeB sending out a duplicate version of the packet? Yes. But I think it's better than taking checkpoint after. If we checkpoint after sending packet, let's say it sent TCP ACK to the= client,=20 and if a hardware failure occurred to NodeA during the transaction *but= the=20 client received the TCP ACK*, NodeB will resume from the previous state= , and it=20 may need to receive some data from the client. However, because the cli= ent has=20 already receiver TCP=E3=80=80ACK, it won't resend the data to NodeB. I= t looks this=20 data is going to be dropped. Anyway, I've just started planning to move the sync point to network/bl= ock=20 layer, and I would post the result for discussion again.