From: Chegu Vinod <chegu_vinod@hp.com>
To: Orit Wasserman <owasserm@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org,
anthony@codemonkey.ws, quintela@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
Date: Tue, 30 Apr 2013 10:51:03 -0700 [thread overview]
Message-ID: <51800487.4020507@hp.com> (raw)
In-Reply-To: <517FDD8D.6090002@redhat.com>
On 4/30/2013 8:04 AM, Orit Wasserman wrote:
> On 04/27/2013 11:50 PM, Chegu Vinod wrote:
>> Busy enterprise workloads hosted on large sized VM's tend to dirty
>> memory faster than the transfer rate achieved via live guest migration.
>> Despite some good recent improvements (& using dedicated 10Gig NICs
>> between hosts) the live migration does NOT converge.
>>
>> A few options that were discussed/being-pursued to help with
>> the convergence issue include:
>>
>> 1) Slow down guest considerably via cgroup's CPU controls - requires
>> libvirt client support to detect & trigger action, but conceptually
>> similar to this RFC change.
>>
>> 2) Speed up transfer rate:
>> - RDMA based Pre-copy - lower overhead and fast (Unfortunately
>> has a few restrictions and some customers still choose not
>> to deploy RDMA :-( ).
>> - Add parallelism to improve transfer rate and use multiple 10Gig
>> connections (bonded). - could add some overhead on the host.
>>
>> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
>> promising but need to consider & handle newer failure scenarios.
>>
>> If an enterprise user chooses to force convergence of their migration
>> via the new capability "auto-converge" then with this change we auto-detect
>> lack of convergence scenario and trigger a slow down of the workload
>> by explicitly disallowing the VCPUs from spending much time in the VM
>> context.
>>
>> The migration thread tries to catchup and this eventually leads
>> to convergence in some "deterministic" amount of time. Yes it does
>> impact the performance of all the VCPUs but in my observation that
>> lasts only for a short duration of time. i.e. we end up entering
>> stage 3 (downtime phase) soon after that.
>>
>> No exernal trigger is required (unlike option 1) and it can co-exist
>> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>>
>> Thanks to Juan and Paolo for their useful suggestions.
>>
>> Verified the convergence using the following :
>> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>
>> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>> migrate downtime set to 4seconds).
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: off <----
>> Migration status: active
>> total time: 1487503 milliseconds
>> expected downtime: 519 milliseconds
>> transferred ram: 383749347 kbytes
>> remaining ram: 2753372 kbytes
>> total ram: 268444224 kbytes
>> duplicate: 65461532 pages
>> skipped: 64901568 pages
>> normal: 95750218 pages
>> normal bytes: 383000872 kbytes
>> dirty pages rate: 67551 pages
>>
>> ---
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: on <----
>> Migration status: completed
>> total time: 241161 milliseconds
>> downtime: 6373 milliseconds
>> transferred ram: 28235307 kbytes
>> remaining ram: 0 kbytes
>> total ram: 268444224 kbytes
>> duplicate: 64946416 pages
>> skipped: 64903523 pages
>> normal: 7044971 pages
>> normal bytes: 28179884 kbytes
>>
>> Changes from v1:
>> - rebased to latest qemu.git
>> - added auto-converge capability(default off) - suggested by Anthony Liguori &
>> Eric Blake.
>>
>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>> ---
>> arch_init.c | 44 +++++++++++++++++++++++++++++++++++
>> cpus.c | 12 +++++++++
>> include/migration/migration.h | 12 +++++++++
>> include/qemu/main-loop.h | 3 ++
>> kvm-all.c | 51 +++++++++++++++++++++++++++++++++++++++++
>> migration.c | 15 ++++++++++++
>> qapi-schema.json | 6 ++++-
>> 7 files changed, 142 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 92de1bd..6dcc742 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -104,6 +104,7 @@ int graphic_depth = 15;
>> #endif
>>
>> const uint32_t arch_type = QEMU_ARCH;
>> +static uint64_t mig_throttle_on;
>>
>> /***********************************************************/
>> /* ram save/restore */
>> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
>> MigrationState *s = migrate_get_current();
>> static int64_t start_time;
>> static int64_t num_dirty_pages_period;
>> + static int64_t bytes_xfer_prev;
>> int64_t end_time;
>> + int64_t bytes_xfer_now;
>> + static int dirty_rate_high_cnt;
>> +
>> + if (migrate_auto_converge() && !bytes_xfer_prev) {
>> + bytes_xfer_prev = ram_bytes_transferred();
>> + }
>>
>> if (!start_time) {
>> start_time = qemu_get_clock_ms(rt_clock);
>> }
>>
>> +
>> trace_migration_bitmap_sync_start();
>> memory_global_sync_dirty_bitmap(get_system_memory());
>>
>> @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
>>
>> /* more than 1 second = 1000 millisecons */
>> if (end_time > start_time + 1000) {
>> + if (migrate_auto_converge()) {
>> + /* The following detection logic can be refined later. For now:
>> + Check to see if the dirtied bytes is 50% more than the approx.
>> + amount of bytes that just got transferred since the last time we
>> + were in this routine. If that happens N times (for now N==5)
>> + we turn on the throttle down logic */
>> + bytes_xfer_now = ram_bytes_transferred();
>> + if (s->dirty_pages_rate &&
>> + ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
>> + ((bytes_xfer_now - bytes_xfer_prev)/2))) {
>> + if (dirty_rate_high_cnt++ > 5) {
>> + DPRINTF("Unable to converge. Throtting down guest\n");
>> + mig_throttle_on = 1;
>> + }
> Why not check to see if mig_throttle_on is already on?
Once its set to 1 it shouldn't really matter setting it again to 1. Not
sure I understand the need for adding the check.
>
>> + }
>> + bytes_xfer_prev = bytes_xfer_now;
>> + }
>> s->dirty_pages_rate = num_dirty_pages_period * 1000
>> / (end_time - start_time);
>> s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
>> @@ -496,6 +522,24 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>> return bytes_sent;
>> }
>>
>> +bool throttling_needed(void)
>> +{
>> + bool value;
>> +
>> + if (!migrate_auto_converge()) {
>> + return false;
>> + }
>> +
>> + qemu_mutex_lock_mig_throttle();
>> + value = mig_throttle_on;
>> + qemu_mutex_unlock_mig_throttle();
>> +
>> + if (value) {
>> + return true;
>> + }
>> + return false;
>> +}
>> +
> Why not return value here ?
yes
> Cheers,
> Orit
>
>> static uint64_t bytes_transferred;
>>
>> static ram_addr_t ram_save_remaining(void)
>> diff --git a/cpus.c b/cpus.c
>> index 5a98a37..615c25a 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void)
>> #endif /* _WIN32 */
>>
>> static QemuMutex qemu_global_mutex;
>> +static QemuMutex qemu_mig_throttle_mutex;
>> static QemuCond qemu_io_proceeded_cond;
>> static bool iothread_requesting_mutex;
>>
>> @@ -638,6 +639,7 @@ void qemu_init_cpu_loop(void)
>> qemu_cond_init(&qemu_work_cond);
>> qemu_cond_init(&qemu_io_proceeded_cond);
>> qemu_mutex_init(&qemu_global_mutex);
>> + qemu_mutex_init(&qemu_mig_throttle_mutex);
>>
>> qemu_thread_get_self(&io_thread);
>> }
>> @@ -943,6 +945,16 @@ void qemu_mutex_unlock_iothread(void)
>> qemu_mutex_unlock(&qemu_global_mutex);
>> }
>>
>> +void qemu_mutex_lock_mig_throttle(void)
>> +{
>> + qemu_mutex_lock(&qemu_mig_throttle_mutex);
>> +}
>> +
>> +void qemu_mutex_unlock_mig_throttle(void)
>> +{
>> + qemu_mutex_unlock(&qemu_mig_throttle_mutex);
>> +}
>> +
>> static int all_vcpus_paused(void)
>> {
>> CPUArchState *penv = first_cpu;
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index e2acec6..94bdb8c 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -127,4 +127,16 @@ int migrate_use_xbzrle(void);
>> int64_t migrate_xbzrle_cache_size(void);
>>
>> int64_t xbzrle_cache_resize(int64_t new_size);
>> +
>> +#ifndef _QEMU_MIG_THROTTLE
>> +#define _QEMU_MIG_THROTTLE
> Do you need those defines?
Not needed ...will remove them.
>
>> +
>> +bool migrate_auto_converge(void);
>> +
>> +bool throttling_needed(void);
>> +bool throttling_now(void);
>> +void *migration_throttle_down(void *);
>> +
>> +#endif
>> +
>> #endif
>> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
>> index 6f0200a..9a3886d 100644
>> --- a/include/qemu/main-loop.h
>> +++ b/include/qemu/main-loop.h
>> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>> */
>> void qemu_mutex_unlock_iothread(void);
>>
>> +void qemu_mutex_lock_mig_throttle(void);
>> +void qemu_mutex_unlock_mig_throttle(void);
>> +
>> /* internal interfaces */
>>
>> void qemu_fd_register(int fd);
>> diff --git a/kvm-all.c b/kvm-all.c
>> index 2d92721..a92cb77 100644
>> --- a/kvm-all.c
>> +++ b/kvm-all.c
>> @@ -33,6 +33,8 @@
>> #include "exec/memory.h"
>> #include "exec/address-spaces.h"
>> #include "qemu/event_notifier.h"
>> +#include "sysemu/cpus.h"
>> +#include "migration/migration.h"
>>
>> /* This check must be after config-host.h is included */
>> #ifdef CONFIG_EVENTFD
>> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>> KVM_CAP_LAST_INFO
>> };
>>
>> +static void mig_delay_vcpu(void);
>> +
>> static KVMSlot *kvm_alloc_slot(KVMState *s)
>> {
>> int i;
>> @@ -1609,6 +1613,10 @@ int kvm_cpu_exec(CPUArchState *env)
>> }
>> qemu_mutex_unlock_iothread();
>>
>> + if (throttling_needed()) {
>> + mig_delay_vcpu();
>> + }
>> +
>> run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
>>
>> qemu_mutex_lock_iothread();
>> @@ -2032,3 +2040,46 @@ int kvm_on_sigbus(int code, void *addr)
>> {
>> return kvm_arch_on_sigbus(code, addr);
>> }
>> +
>> +static bool throttling;
>> +bool throttling_now(void)
>> +{
>> + if (throttling) {
>> + return true;
>> + }
>> + return false;
>> +}
>> +
> it will be simpler to just return throttling ?
yes. will fix it.
Thanks for your comments
Vinod
>
>> +static void mig_delay_vcpu(void)
>> +{
>> + g_usleep(50*1000);
>> +}
>> +
>> +/* Stub used for getting the vcpu out of VM and into qemu via
>> + run_on_cpu()*/
>> +static void mig_kick_cpu(void *opq)
>> +{
>> + return;
>> +}
>> +
>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>> + much time in the VM. The migration thread will try to catchup.
>> + Workload will experience a greater performance drop but for a shorter
>> + duration.
>> +*/
>> +void *migration_throttle_down(void *opaque)
>> +{
>> + throttling = true;
>> + while (throttling_needed()) {
>> + CPUArchState *penv = first_cpu;
>> + while (penv) {
>> + qemu_mutex_lock_iothread();
>> + run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>> + qemu_mutex_unlock_iothread();
>> + penv = penv->next_cpu;
>> + }
>> + g_usleep(25*1000);
>> + }
>> + throttling = false;
>> + return NULL;
>> +}
>> diff --git a/migration.c b/migration.c
>> index 3eb0fad..834156e 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -24,6 +24,7 @@
>> #include "qemu/thread.h"
>> #include "qmp-commands.h"
>> #include "trace.h"
>> +#include "sysemu/cpus.h"
>>
>> //#define DEBUG_MIGRATION
>>
>> @@ -474,6 +475,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>> max_downtime = (uint64_t)value;
>> }
>>
>> +bool migrate_auto_converge(void)
>> +{
>> + MigrationState *s;
>> +
>> + s = migrate_get_current();
>> +
>> + return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
>> +}
>> +
>> int migrate_use_xbzrle(void)
>> {
>> MigrationState *s;
>> @@ -503,6 +513,7 @@ static void *migration_thread(void *opaque)
>> int64_t max_size = 0;
>> int64_t start_time = initial_time;
>> bool old_vm_running = false;
>> + QemuThread thread;
>>
>> DPRINTF("beginning savevm\n");
>> qemu_savevm_state_begin(s->file, &s->params);
>> @@ -517,6 +528,10 @@ static void *migration_thread(void *opaque)
>> DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>> if (pending_size && pending_size >= max_size) {
>> qemu_savevm_state_iterate(s->file);
>> + if (throttling_needed() && !throttling_now()) {
>> + qemu_thread_create(&thread, migration_throttle_down,
>> + NULL, QEMU_THREAD_DETACHED);
>> + }
>> } else {
>> DPRINTF("done iterating\n");
>> qemu_mutex_lock_iothread();
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 5b0fb3b..b662e33 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -599,10 +599,14 @@
>> # This feature allows us to minimize migration traffic for certain work
>> # loads, by sending compressed difference of the pages
>> #
>> +# @auto-converge: Controls whether or not the we want the migration to
>> +# automaticially detect and force convergence by slowing
>> +# down the guest. Disabled by default.
>> +#
>> # Since: 1.2
>> ##
>> { 'enum': 'MigrationCapability',
>> - 'data': ['xbzrle'] }
>> + 'data': ['xbzrle', 'auto-converge'] }
>>
>> ##
>> # @MigrationCapabilityStatus
>> .
>>
next prev parent reply other threads:[~2013-04-30 17:51 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-27 20:50 [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge Chegu Vinod
2013-04-29 14:53 ` Eric Blake
2013-04-29 17:48 ` Chegu Vinod
2013-04-30 15:04 ` Orit Wasserman
2013-04-30 17:51 ` Chegu Vinod [this message]
2013-04-30 15:20 ` Juan Quintela
2013-04-30 15:55 ` Chegu Vinod
2013-04-30 16:01 ` Juan Quintela
2013-04-30 17:48 ` Chegu Vinod
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51800487.4020507@hp.com \
--to=chegu_vinod@hp.com \
--cc=anthony@codemonkey.ws \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).