From: Dor Laor <dlaor@redhat.com>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: ohmura.kei@lab.ntt.co.jp, kvm@vger.kernel.org,
mtosatti@redhat.com, aliguori@us.ibm.com, qemu-devel@nongnu.org,
yoshikawa.takuya@oss.ntt.co.jp, avi@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Date: Thu, 22 Apr 2010 15:19:36 +0300 [thread overview]
Message-ID: <4BD03ED8.707@redhat.com> (raw)
In-Reply-To: <4BD02684.8060202@lab.ntt.co.jp>
On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
> Dor Laor wrote:
>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>> Hi all,
>>>
>>> We have been implementing the prototype of Kemari for KVM, and we're
>>> sending
>>> this message to share what we have now and TODO lists. Hopefully, we
>>> would like
>>> to get early feedback to keep us in the right direction. Although
>>> advanced
>>> approaches in the TODO lists are fascinating, we would like to run
>>> this project
>>> step by step while absorbing comments from the community. The current
>>> code is
>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>
>>> For those who are new to Kemari for KVM, please take a look at the
>>> following RFC which we posted last year.
>>>
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>
>>> The transmission/transaction protocol, and most of the control logic is
>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>> from
>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>> the
>>> kernel side to guarantee replayability of certain events and
>>> instructions,
>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>> stack, as well
>>> as for optimization purposes, for example.
>>
>> [ snap]
>>
>>>
>>> The rest of this message describes TODO lists grouped by each topic.
>>>
>>> === event tapping ===
>>>
>>> Event tapping is the core component of Kemari, and it decides on which
>>> event the
>>> primary should synchronize with the secondary. The basic assumption
>>> here is
>>> that outgoing I/O operations are idempotent, which is usually true for
>>> disk I/O
>>> and reliable network protocols such as TCP.
>>
>> IMO any type of network even should be stalled too. What if the VM runs
>> non tcp protocol and the packet that the master node sent reached some
>> remote client and before the sync to the slave the master failed?
>
> In current implementation, it is actually stalling any type of network
> that goes through virtio-net.
>
> However, if the application was using unreliable protocols, it should
> have its own recovering mechanism, or it should be completely stateless.
Why do you treat tcp differently? You can damage the entire VM this way
- think of dhcp request that was dropped on the moment you switched
between the master and the slave?
>
>> [snap]
>>
>>
>>> === clock ===
>>>
>>> Since synchronizing the virtual machines every time the TSC is
>>> accessed would be
>>> prohibitive, the transmission of the TSC will be done lazily, which
>>> means
>>> delaying it until there is a non-TSC synchronization point arrives.
>>
>> Why do you specifically care about the tsc sync? When you sync all the
>> IO model on snapshot it also synchronizes the tsc.
So, do you agree that an extra clock synchronization is not needed since
it is done anyway as part of the live migration state sync?
>>
>> In general, can you please explain the 'algorithm' for continuous
>> snapshots (is that what you like to do?):
>
> Yes, of course.
> Sorry for being less informative.
>
>> A trivial one would we to :
>> - do X online snapshots/sec
>
> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the
> guest was almost idle, there will be no snapshots in 5sec. On the other
> hand, if the guest was running I/O intensive workloads (netperf, iozone
> for example), there will be about 50 snapshots/sec.
>
>> - Stall all IO (disk/block) from the guest to the outside world
>> until the previous snapshot reaches the slave.
>
> Yes, it does.
>
>> - Snapshots are made of
>
> Full device model + diff of dirty pages from the last snapshot.
>
>> - diff of dirty pages from last snapshot
>
> This also depends on the workload.
> In case of I/O intensive workloads, dirty pages are usually less than 100.
The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much
more)
>
>> - Qemu device model (+kvm's) diff from last.
>
> We're currently sending full copy because we're completely reusing this
> part of existing live migration framework.
>
> Last time we measured, it was about 13KB.
> But it varies by which QEMU version is used.
>
>> You can do 'light' snapshots in between to send dirty pages to reduce
>> snapshot time.
>
> I agree. That's one of the advanced topic we would like to try too.
>
>> I wrote the above to serve a reference for your comments so it will map
>> into my mind. Thanks, dor
>
> Thank your for the guidance.
> I hope this answers to your question.
>
> At the same time, I would also be happy it we could discuss how to
> implement too. In fact, we needed a hack to prevent rip from proceeding
> in KVM, which turned out that it was not the best workaround.
There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
standard live migration)
- Stop + fork + cont the father
Or mark the recent dirty pages that were not sent to the remote as write
protected and copy them if touched.
>
> Thanks,
>
> Yoshi
>
>>
>>>
>>> TODO:
>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>
>>> === usability ===
>>>
>>> These are items that defines how users interact with Kemari.
>>>
>>> TODO:
>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>> side of things.
>>> - Some device emulators might need minor modifications to work well
>>> with Kemari. Use white(black)-listing to take the burden of
>>> choosing the right device model off the users.
>>>
>>> === optimizations ===
>>>
>>> Although the big picture can be realized by completing the TODO list
>>> above, we
>>> need some optimizations/enhancements to make Kemari useful in real
>>> world, and
>>> these are items what needs to be done for that.
>>>
>>> TODO:
>>> - SMP (for the sake of performance might need to implement a
>>> synchronization protocol that can maintain two or more
>>> synchronization points active at any given moment)
>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>> are really dirty).
>>>
>>>
>>> Any comments/suggestions would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Yoshi
>>>
>>> --
>>>
>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>> Without this patch VCPU state is already proceeded before
>>> synchronization, and after failover to the VM on the receiver, it
>>> hangs because of this.
>>>
>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 1 +
>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>> arch/x86/kvm/x86.c | 4 ++++
>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 26c629a..7b8f514 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>> int in;
>>> int port;
>>> int size;
>>> + bool lazy_skip;
>>> };
>>>
>>> /*
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index d04c7ad..e373245 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>> {
>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>> - int size, in, string;
>>> + int size, in, string, ret;
>>> unsigned port;
>>>
>>> ++svm->vcpu.stat.io_exits;
>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>> port = io_info>> 16;
>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>> - skip_emulated_instruction(&svm->vcpu);
>>>
>>> - return kvm_fast_pio_out(vcpu, size, port);
>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>> + if (ret)
>>> + skip_emulated_instruction(&svm->vcpu);
>>> + else
>>> + vcpu->arch.pio.lazy_skip = true;
>>> +
>>> + return ret;
>>> }
>>>
>>> static int nmi_interception(struct vcpu_svm *svm)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 41e63bb..09052d6 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>> *vcpu)
>>> static int handle_io(struct kvm_vcpu *vcpu)
>>> {
>>> unsigned long exit_qualification;
>>> - int size, in, string;
>>> + int size, in, string, ret;
>>> unsigned port;
>>>
>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>
>>> port = exit_qualification>> 16;
>>> size = (exit_qualification& 7) + 1;
>>> - skip_emulated_instruction(vcpu);
>>>
>>> - return kvm_fast_pio_out(vcpu, size, port);
>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>> + if (ret)
>>> + skip_emulated_instruction(vcpu);
>>> + else
>>> + vcpu->arch.pio.lazy_skip = true;
>>> +
>>> + return ret;
>>> }
>>>
>>> static void
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index fd5c3d3..cc308d2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>> *vcpu, struct kvm_run *kvm_run)
>>> if (!irqchip_in_kernel(vcpu->kvm))
>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>
>>> + if (vcpu->arch.pio.lazy_skip)
>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>> + vcpu->arch.pio.lazy_skip = false;
>>> +
>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>> vcpu->arch.emulate_ctxt.restart) {
>>> if (vcpu->mmio_needed) {
>>
>>
>>
>>
>
>
>
next prev parent reply other threads:[~2010-04-22 12:18 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-21 5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty Yoshiaki Tamura
2010-04-22 19:26 ` [Qemu-devel] " Anthony Liguori
2010-04-23 2:09 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 02/20] Introduce cpu_physical_memory_get_dirty_range() Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 03/20] Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2010-04-21 8:03 ` [Qemu-devel] " Stefan Hajnoczi
2010-04-21 8:27 ` Yoshiaki Tamura
2010-04-23 9:53 ` Avi Kivity
2010-04-23 9:59 ` Yoshiaki Tamura
2010-04-23 13:14 ` Avi Kivity
2010-04-26 10:43 ` Yoshiaki Tamura
2010-04-23 13:26 ` Anthony Liguori
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops() Yoshiaki Tamura
2010-04-22 19:28 ` [Qemu-devel] " Anthony Liguori
2010-04-23 3:37 ` Yoshiaki Tamura
2010-04-23 13:22 ` Anthony Liguori
2010-04-23 13:48 ` Avi Kivity
2010-05-03 9:32 ` Yoshiaki Tamura
2010-05-03 12:05 ` Anthony Liguori
2010-05-03 15:36 ` Yoshiaki Tamura
2010-05-03 16:07 ` Anthony Liguori
2010-04-26 10:43 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 06/20] Introduce iovec util functions, qemu_iovec_to_vector() and qemu_iovec_to_size() Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile Yoshiaki Tamura
2010-04-22 19:29 ` [Qemu-devel] " Anthony Liguori
2010-04-23 4:02 ` Yoshiaki Tamura
2010-04-23 13:23 ` Anthony Liguori
2010-04-26 10:43 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages Yoshiaki Tamura
2010-04-22 19:31 ` [Qemu-devel] " Anthony Liguori
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 09/20] Introduce writev and read to FdMigrationState Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header Yoshiaki Tamura
2010-04-22 19:34 ` [Qemu-devel] " Anthony Liguori
2010-04-23 4:25 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 11/20] Introduce some socket util functions Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 12/20] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 13/20] Introduce util functions to control ft_transaction from savevm layer Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all() Yoshiaki Tamura
2010-04-22 19:37 ` [Qemu-devel] " Anthony Liguori
2010-04-23 3:29 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 15/20] Introduce FT mode support to configure Yoshiaki Tamura
2010-04-22 19:38 ` [Qemu-devel] " Anthony Liguori
2010-04-23 3:09 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 16/20] Introduce event_tap fucntions and ft_tranx_ready() Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 17/20] Modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 18/20] Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally Yoshiaki Tamura
2010-04-22 19:39 ` [Qemu-devel] " Anthony Liguori
2010-04-23 4:51 ` Yoshiaki Tamura
2010-04-21 5:57 ` [Qemu-devel] [RFC PATCH 20/20] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2010-04-22 8:58 ` [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Dor Laor
2010-04-22 10:35 ` Yoshiaki Tamura
2010-04-22 11:36 ` Takuya Yoshikawa
2010-04-22 12:35 ` Yoshiaki Tamura
2010-04-22 12:19 ` Dor Laor [this message]
2010-04-22 13:16 ` Yoshiaki Tamura
2010-04-22 20:33 ` Anthony Liguori
2010-04-23 1:53 ` Yoshiaki Tamura
2010-04-23 13:20 ` Anthony Liguori
2010-04-26 10:44 ` Yoshiaki Tamura
2010-04-22 20:38 ` Dor Laor
2010-04-23 5:17 ` Yoshiaki Tamura
2010-04-23 7:36 ` Fernando Luis Vázquez Cao
2010-04-25 21:52 ` Dor Laor
2010-04-22 16:15 ` Jamie Lokier
2010-04-23 0:20 ` Yoshiaki Tamura
2010-04-23 15:07 ` Jamie Lokier
2010-04-22 19:42 ` [Qemu-devel] " Anthony Liguori
2010-04-23 0:45 ` Yoshiaki Tamura
2010-04-23 13:10 ` Anthony Liguori
2010-04-23 13:24 ` Avi Kivity
2010-04-26 10:44 ` Yoshiaki Tamura
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4BD03ED8.707@redhat.com \
--to=dlaor@redhat.com \
--cc=aliguori@us.ibm.com \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=mtosatti@redhat.com \
--cc=ohmura.kei@lab.ntt.co.jp \
--cc=qemu-devel@nongnu.org \
--cc=tamura.yoshiaki@lab.ntt.co.jp \
--cc=yoshikawa.takuya@oss.ntt.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).