From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1O4sE4-0006RK-BI for qemu-devel@nongnu.org; Thu, 22 Apr 2010 04:57:36 -0400 Received: from [140.186.70.92] (port=34151 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1O4sE2-0006QX-Tf for qemu-devel@nongnu.org; Thu, 22 Apr 2010 04:57:36 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1O4sE0-0006up-0D for qemu-devel@nongnu.org; Thu, 22 Apr 2010 04:57:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43520) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1O4sDz-0006uf-Is for qemu-devel@nongnu.org; Thu, 22 Apr 2010 04:57:31 -0400 Message-ID: <4BD00FBA.5040604@redhat.com> Date: Thu, 22 Apr 2010 11:58:34 +0300 From: Dor Laor MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 References: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> In-Reply-To: <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Reply-To: dlaor@redhat.com List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yoshiaki Tamura Cc: ohmura.kei@lab.ntt.co.jp, kvm@vger.kernel.org, mtosatti@redhat.com, aliguori@us.ibm.com, qemu-devel@nongnu.org, yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: > Hi all, > > We have been implementing the prototype of Kemari for KVM, and we're sending > this message to share what we have now and TODO lists. Hopefully, we would like > to get early feedback to keep us in the right direction. Although advanced > approaches in the TODO lists are fascinating, we would like to run this project > step by step while absorbing comments from the community. The current code is > based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. > > For those who are new to Kemari for KVM, please take a look at the > following RFC which we posted last year. > > http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html > > The transmission/transaction protocol, and most of the control logic is > implemented in QEMU. However, we needed a hack in KVM to prevent rip from > proceeding before synchronizing VMs. It may also need some plumbing in the > kernel side to guarantee replayability of certain events and instructions, > integrate the RAS capabilities of newer x86 hardware with the HA stack, as well > as for optimization purposes, for example. [ snap] > > The rest of this message describes TODO lists grouped by each topic. > > === event tapping === > > Event tapping is the core component of Kemari, and it decides on which event the > primary should synchronize with the secondary. The basic assumption here is > that outgoing I/O operations are idempotent, which is usually true for disk I/O > and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? [snap] > === clock === > > Since synchronizing the virtual machines every time the TSC is accessed would be > prohibitive, the transmission of the TSC will be done lazily, which means > delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): A trivial one would we to : - do X online snapshots/sec - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. - Snapshots are made of - diff of dirty pages from last snapshot - Qemu device model (+kvm's) diff from last. You can do 'light' snapshots in between to send dirty pages to reduce snapshot time. I wrote the above to serve a reference for your comments so it will map into my mind. Thanks, dor > > TODO: > - Synchronization of clock sources (need to intercept TSC reads, etc). > > === usability === > > These are items that defines how users interact with Kemari. > > TODO: > - Kemarid daemon that takes care of the cluster management/monitoring > side of things. > - Some device emulators might need minor modifications to work well > with Kemari. Use white(black)-listing to take the burden of > choosing the right device model off the users. > > === optimizations === > > Although the big picture can be realized by completing the TODO list above, we > need some optimizations/enhancements to make Kemari useful in real world, and > these are items what needs to be done for that. > > TODO: > - SMP (for the sake of performance might need to implement a > synchronization protocol that can maintain two or more > synchronization points active at any given moment) > - VGA (leverage VNC's subtilting mechanism to identify fb pages that > are really dirty). > > > Any comments/suggestions would be greatly appreciated. > > Thanks, > > Yoshi > > -- > > Kemari starts synchronizing VMs when QEMU handles I/O requests. > Without this patch VCPU state is already proceeded before > synchronization, and after failover to the VM on the receiver, it > hangs because of this. > > Signed-off-by: Yoshiaki Tamura > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/svm.c | 11 ++++++++--- > arch/x86/kvm/vmx.c | 11 ++++++++--- > arch/x86/kvm/x86.c | 4 ++++ > 4 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 26c629a..7b8f514 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -227,6 +227,7 @@ struct kvm_pio_request { > int in; > int port; > int size; > + bool lazy_skip; > }; > > /* > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index d04c7ad..e373245 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm) > { > struct kvm_vcpu *vcpu =&svm->vcpu; > u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */ > - int size, in, string; > + int size, in, string, ret; > unsigned port; > > ++svm->vcpu.stat.io_exits; > @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm) > port = io_info>> 16; > size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT; > svm->next_rip = svm->vmcb->control.exit_info_2; > - skip_emulated_instruction(&svm->vcpu); > > - return kvm_fast_pio_out(vcpu, size, port); > + ret = kvm_fast_pio_out(vcpu, size, port); > + if (ret) > + skip_emulated_instruction(&svm->vcpu); > + else > + vcpu->arch.pio.lazy_skip = true; > + > + return ret; > } > > static int nmi_interception(struct vcpu_svm *svm) > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 41e63bb..09052d6 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu *vcpu) > static int handle_io(struct kvm_vcpu *vcpu) > { > unsigned long exit_qualification; > - int size, in, string; > + int size, in, string, ret; > unsigned port; > > exit_qualification = vmcs_readl(EXIT_QUALIFICATION); > @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu) > > port = exit_qualification>> 16; > size = (exit_qualification& 7) + 1; > - skip_emulated_instruction(vcpu); > > - return kvm_fast_pio_out(vcpu, size, port); > + ret = kvm_fast_pio_out(vcpu, size, port); > + if (ret) > + skip_emulated_instruction(vcpu); > + else > + vcpu->arch.pio.lazy_skip = true; > + > + return ret; > } > > static void > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index fd5c3d3..cc308d2 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) > if (!irqchip_in_kernel(vcpu->kvm)) > kvm_set_cr8(vcpu, kvm_run->cr8); > > + if (vcpu->arch.pio.lazy_skip) > + kvm_x86_ops->skip_emulated_instruction(vcpu); > + vcpu->arch.pio.lazy_skip = false; > + > if (vcpu->arch.pio.count || vcpu->mmio_needed || > vcpu->arch.emulate_ctxt.restart) { > if (vcpu->mmio_needed) {