From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:41954) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UXEhy-0007cI-Th for qemu-devel@nongnu.org; Tue, 30 Apr 2013 13:51:24 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UXEhl-0005bj-9X for qemu-devel@nongnu.org; Tue, 30 Apr 2013 13:51:18 -0400 Received: from g1t0028.austin.hp.com ([15.216.28.35]:4739) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UXEhk-0005b2-VP for qemu-devel@nongnu.org; Tue, 30 Apr 2013 13:51:05 -0400 Message-ID: <51800487.4020507@hp.com> Date: Tue, 30 Apr 2013 10:51:03 -0700 From: Chegu Vinod MIME-Version: 1.0 References: <1367095836-19318-1-git-send-email-chegu_vinod@hp.com> <517FDD8D.6090002@redhat.com> In-Reply-To: <517FDD8D.6090002@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Orit Wasserman Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, anthony@codemonkey.ws, quintela@redhat.com On 4/30/2013 8:04 AM, Orit Wasserman wrote: > On 04/27/2013 11:50 PM, Chegu Vinod wrote: >> Busy enterprise workloads hosted on large sized VM's tend to dirty >> memory faster than the transfer rate achieved via live guest migration. >> Despite some good recent improvements (& using dedicated 10Gig NICs >> between hosts) the live migration does NOT converge. >> >> A few options that were discussed/being-pursued to help with >> the convergence issue include: >> >> 1) Slow down guest considerably via cgroup's CPU controls - requires >> libvirt client support to detect & trigger action, but conceptually >> similar to this RFC change. >> >> 2) Speed up transfer rate: >> - RDMA based Pre-copy - lower overhead and fast (Unfortunately >> has a few restrictions and some customers still choose not >> to deploy RDMA :-( ). >> - Add parallelism to improve transfer rate and use multiple 10Gig >> connections (bonded). - could add some overhead on the host. >> >> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds >> promising but need to consider & handle newer failure scenarios. >> >> If an enterprise user chooses to force convergence of their migration >> via the new capability "auto-converge" then with this change we auto-detect >> lack of convergence scenario and trigger a slow down of the workload >> by explicitly disallowing the VCPUs from spending much time in the VM >> context. >> >> The migration thread tries to catchup and this eventually leads >> to convergence in some "deterministic" amount of time. Yes it does >> impact the performance of all the VCPUs but in my observation that >> lasts only for a short duration of time. i.e. we end up entering >> stage 3 (downtime phase) soon after that. >> >> No exernal trigger is required (unlike option 1) and it can co-exist >> with enhancements being pursued as part of Option 2 (e.g. RDMA). >> >> Thanks to Juan and Paolo for their useful suggestions. >> >> Verified the convergence using the following : >> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) >> - OLTP like workload running on a 80VCPU/512G guest (~80% busy) >> >> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and >> migrate downtime set to 4seconds). >> >> (qemu) info migrate >> capabilities: xbzrle: off auto-converge: off <---- >> Migration status: active >> total time: 1487503 milliseconds >> expected downtime: 519 milliseconds >> transferred ram: 383749347 kbytes >> remaining ram: 2753372 kbytes >> total ram: 268444224 kbytes >> duplicate: 65461532 pages >> skipped: 64901568 pages >> normal: 95750218 pages >> normal bytes: 383000872 kbytes >> dirty pages rate: 67551 pages >> >> --- >> >> (qemu) info migrate >> capabilities: xbzrle: off auto-converge: on <---- >> Migration status: completed >> total time: 241161 milliseconds >> downtime: 6373 milliseconds >> transferred ram: 28235307 kbytes >> remaining ram: 0 kbytes >> total ram: 268444224 kbytes >> duplicate: 64946416 pages >> skipped: 64903523 pages >> normal: 7044971 pages >> normal bytes: 28179884 kbytes >> >> Changes from v1: >> - rebased to latest qemu.git >> - added auto-converge capability(default off) - suggested by Anthony Liguori & >> Eric Blake. >> >> Signed-off-by: Chegu Vinod >> --- >> arch_init.c | 44 +++++++++++++++++++++++++++++++++++ >> cpus.c | 12 +++++++++ >> include/migration/migration.h | 12 +++++++++ >> include/qemu/main-loop.h | 3 ++ >> kvm-all.c | 51 +++++++++++++++++++++++++++++++++++++++++ >> migration.c | 15 ++++++++++++ >> qapi-schema.json | 6 ++++- >> 7 files changed, 142 insertions(+), 1 deletions(-) >> >> diff --git a/arch_init.c b/arch_init.c >> index 92de1bd..6dcc742 100644 >> --- a/arch_init.c >> +++ b/arch_init.c >> @@ -104,6 +104,7 @@ int graphic_depth = 15; >> #endif >> >> const uint32_t arch_type = QEMU_ARCH; >> +static uint64_t mig_throttle_on; >> >> /***********************************************************/ >> /* ram save/restore */ >> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void) >> MigrationState *s = migrate_get_current(); >> static int64_t start_time; >> static int64_t num_dirty_pages_period; >> + static int64_t bytes_xfer_prev; >> int64_t end_time; >> + int64_t bytes_xfer_now; >> + static int dirty_rate_high_cnt; >> + >> + if (migrate_auto_converge() && !bytes_xfer_prev) { >> + bytes_xfer_prev = ram_bytes_transferred(); >> + } >> >> if (!start_time) { >> start_time = qemu_get_clock_ms(rt_clock); >> } >> >> + >> trace_migration_bitmap_sync_start(); >> memory_global_sync_dirty_bitmap(get_system_memory()); >> >> @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) >> >> /* more than 1 second = 1000 millisecons */ >> if (end_time > start_time + 1000) { >> + if (migrate_auto_converge()) { >> + /* The following detection logic can be refined later. For now: >> + Check to see if the dirtied bytes is 50% more than the approx. >> + amount of bytes that just got transferred since the last time we >> + were in this routine. If that happens N times (for now N==5) >> + we turn on the throttle down logic */ >> + bytes_xfer_now = ram_bytes_transferred(); >> + if (s->dirty_pages_rate && >> + ((num_dirty_pages_period*TARGET_PAGE_SIZE) > >> + ((bytes_xfer_now - bytes_xfer_prev)/2))) { >> + if (dirty_rate_high_cnt++ > 5) { >> + DPRINTF("Unable to converge. Throtting down guest\n"); >> + mig_throttle_on = 1; >> + } > Why not check to see if mig_throttle_on is already on? Once its set to 1 it shouldn't really matter setting it again to 1. Not sure I understand the need for adding the check. > >> + } >> + bytes_xfer_prev = bytes_xfer_now; >> + } >> s->dirty_pages_rate = num_dirty_pages_period * 1000 >> / (end_time - start_time); >> s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE; >> @@ -496,6 +522,24 @@ static int ram_save_block(QEMUFile *f, bool last_stage) >> return bytes_sent; >> } >> >> +bool throttling_needed(void) >> +{ >> + bool value; >> + >> + if (!migrate_auto_converge()) { >> + return false; >> + } >> + >> + qemu_mutex_lock_mig_throttle(); >> + value = mig_throttle_on; >> + qemu_mutex_unlock_mig_throttle(); >> + >> + if (value) { >> + return true; >> + } >> + return false; >> +} >> + > Why not return value here ? yes > Cheers, > Orit > >> static uint64_t bytes_transferred; >> >> static ram_addr_t ram_save_remaining(void) >> diff --git a/cpus.c b/cpus.c >> index 5a98a37..615c25a 100644 >> --- a/cpus.c >> +++ b/cpus.c >> @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void) >> #endif /* _WIN32 */ >> >> static QemuMutex qemu_global_mutex; >> +static QemuMutex qemu_mig_throttle_mutex; >> static QemuCond qemu_io_proceeded_cond; >> static bool iothread_requesting_mutex; >> >> @@ -638,6 +639,7 @@ void qemu_init_cpu_loop(void) >> qemu_cond_init(&qemu_work_cond); >> qemu_cond_init(&qemu_io_proceeded_cond); >> qemu_mutex_init(&qemu_global_mutex); >> + qemu_mutex_init(&qemu_mig_throttle_mutex); >> >> qemu_thread_get_self(&io_thread); >> } >> @@ -943,6 +945,16 @@ void qemu_mutex_unlock_iothread(void) >> qemu_mutex_unlock(&qemu_global_mutex); >> } >> >> +void qemu_mutex_lock_mig_throttle(void) >> +{ >> + qemu_mutex_lock(&qemu_mig_throttle_mutex); >> +} >> + >> +void qemu_mutex_unlock_mig_throttle(void) >> +{ >> + qemu_mutex_unlock(&qemu_mig_throttle_mutex); >> +} >> + >> static int all_vcpus_paused(void) >> { >> CPUArchState *penv = first_cpu; >> diff --git a/include/migration/migration.h b/include/migration/migration.h >> index e2acec6..94bdb8c 100644 >> --- a/include/migration/migration.h >> +++ b/include/migration/migration.h >> @@ -127,4 +127,16 @@ int migrate_use_xbzrle(void); >> int64_t migrate_xbzrle_cache_size(void); >> >> int64_t xbzrle_cache_resize(int64_t new_size); >> + >> +#ifndef _QEMU_MIG_THROTTLE >> +#define _QEMU_MIG_THROTTLE > Do you need those defines? Not needed ...will remove them. > >> + >> +bool migrate_auto_converge(void); >> + >> +bool throttling_needed(void); >> +bool throttling_now(void); >> +void *migration_throttle_down(void *); >> + >> +#endif >> + >> #endif >> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h >> index 6f0200a..9a3886d 100644 >> --- a/include/qemu/main-loop.h >> +++ b/include/qemu/main-loop.h >> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void); >> */ >> void qemu_mutex_unlock_iothread(void); >> >> +void qemu_mutex_lock_mig_throttle(void); >> +void qemu_mutex_unlock_mig_throttle(void); >> + >> /* internal interfaces */ >> >> void qemu_fd_register(int fd); >> diff --git a/kvm-all.c b/kvm-all.c >> index 2d92721..a92cb77 100644 >> --- a/kvm-all.c >> +++ b/kvm-all.c >> @@ -33,6 +33,8 @@ >> #include "exec/memory.h" >> #include "exec/address-spaces.h" >> #include "qemu/event_notifier.h" >> +#include "sysemu/cpus.h" >> +#include "migration/migration.h" >> >> /* This check must be after config-host.h is included */ >> #ifdef CONFIG_EVENTFD >> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = { >> KVM_CAP_LAST_INFO >> }; >> >> +static void mig_delay_vcpu(void); >> + >> static KVMSlot *kvm_alloc_slot(KVMState *s) >> { >> int i; >> @@ -1609,6 +1613,10 @@ int kvm_cpu_exec(CPUArchState *env) >> } >> qemu_mutex_unlock_iothread(); >> >> + if (throttling_needed()) { >> + mig_delay_vcpu(); >> + } >> + >> run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0); >> >> qemu_mutex_lock_iothread(); >> @@ -2032,3 +2040,46 @@ int kvm_on_sigbus(int code, void *addr) >> { >> return kvm_arch_on_sigbus(code, addr); >> } >> + >> +static bool throttling; >> +bool throttling_now(void) >> +{ >> + if (throttling) { >> + return true; >> + } >> + return false; >> +} >> + > it will be simpler to just return throttling ? yes. will fix it. Thanks for your comments Vinod > >> +static void mig_delay_vcpu(void) >> +{ >> + g_usleep(50*1000); >> +} >> + >> +/* Stub used for getting the vcpu out of VM and into qemu via >> + run_on_cpu()*/ >> +static void mig_kick_cpu(void *opq) >> +{ >> + return; >> +} >> + >> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending >> + much time in the VM. The migration thread will try to catchup. >> + Workload will experience a greater performance drop but for a shorter >> + duration. >> +*/ >> +void *migration_throttle_down(void *opaque) >> +{ >> + throttling = true; >> + while (throttling_needed()) { >> + CPUArchState *penv = first_cpu; >> + while (penv) { >> + qemu_mutex_lock_iothread(); >> + run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); >> + qemu_mutex_unlock_iothread(); >> + penv = penv->next_cpu; >> + } >> + g_usleep(25*1000); >> + } >> + throttling = false; >> + return NULL; >> +} >> diff --git a/migration.c b/migration.c >> index 3eb0fad..834156e 100644 >> --- a/migration.c >> +++ b/migration.c >> @@ -24,6 +24,7 @@ >> #include "qemu/thread.h" >> #include "qmp-commands.h" >> #include "trace.h" >> +#include "sysemu/cpus.h" >> >> //#define DEBUG_MIGRATION >> >> @@ -474,6 +475,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) >> max_downtime = (uint64_t)value; >> } >> >> +bool migrate_auto_converge(void) >> +{ >> + MigrationState *s; >> + >> + s = migrate_get_current(); >> + >> + return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE]; >> +} >> + >> int migrate_use_xbzrle(void) >> { >> MigrationState *s; >> @@ -503,6 +513,7 @@ static void *migration_thread(void *opaque) >> int64_t max_size = 0; >> int64_t start_time = initial_time; >> bool old_vm_running = false; >> + QemuThread thread; >> >> DPRINTF("beginning savevm\n"); >> qemu_savevm_state_begin(s->file, &s->params); >> @@ -517,6 +528,10 @@ static void *migration_thread(void *opaque) >> DPRINTF("pending size %lu max %lu\n", pending_size, max_size); >> if (pending_size && pending_size >= max_size) { >> qemu_savevm_state_iterate(s->file); >> + if (throttling_needed() && !throttling_now()) { >> + qemu_thread_create(&thread, migration_throttle_down, >> + NULL, QEMU_THREAD_DETACHED); >> + } >> } else { >> DPRINTF("done iterating\n"); >> qemu_mutex_lock_iothread(); >> diff --git a/qapi-schema.json b/qapi-schema.json >> index 5b0fb3b..b662e33 100644 >> --- a/qapi-schema.json >> +++ b/qapi-schema.json >> @@ -599,10 +599,14 @@ >> # This feature allows us to minimize migration traffic for certain work >> # loads, by sending compressed difference of the pages >> # >> +# @auto-converge: Controls whether or not the we want the migration to >> +# automaticially detect and force convergence by slowing >> +# down the guest. Disabled by default. >> +# >> # Since: 1.2 >> ## >> { 'enum': 'MigrationCapability', >> - 'data': ['xbzrle'] } >> + 'data': ['xbzrle', 'auto-converge'] } >> >> ## >> # @MigrationCapabilityStatus >> . >>