From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:58498) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UZfsA-0001dh-5e for qemu-devel@nongnu.org; Tue, 07 May 2013 07:16:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UZfs6-00014B-Hp for qemu-devel@nongnu.org; Tue, 07 May 2013 07:15:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:64564) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UZfs6-00013g-7D for qemu-devel@nongnu.org; Tue, 07 May 2013 07:15:50 -0400 Message-ID: <5188E243.6060308@redhat.com> Date: Tue, 07 May 2013 13:15:15 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <1367902826-28190-1-git-send-email-chegu_vinod@hp.com> In-Reply-To: <1367902826-28190-1-git-send-email-chegu_vinod@hp.com> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH v4] Throttle-down guest when live migration does not converge. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chegu Vinod Cc: owasserm@redhat.com, qemu-devel@nongnu.org, anthony@codemonkey.ws, quintela@redhat.com Il 07/05/2013 07:00, Chegu Vinod ha scritto: > Busy enterprise workloads hosted on large sized VM's tend to dirty > memory faster than the transfer rate achieved via live guest migration. > Despite some good recent improvements (& using dedicated 10Gig NICs > between hosts) the live migration does NOT converge. > > If a user chooses to force convergence of their migration via a new > migration capability "auto-converge" then this change will auto-detect > lack of convergence scenario and trigger a slow down of the workload > by explicitly disallowing the VCPUs from spending much time in the VM > context. > > The migration thread tries to catchup and this eventually leads > to convergence in some "deterministic" amount of time. Yes it does > impact the performance of all the VCPUs but in my observation that > lasts only for a short duration of time. i.e. we end up entering > stage 3 (downtime phase) soon after that. No external trigger is > required. > > Thanks to Juan and Paolo for their useful suggestions. > > Verified the convergence using the following : > - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) > - OLTP like workload running on a 80VCPU/512G guest (~80% busy) > > Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and > migrate downtime set to 4seconds). > > (qemu) info migrate > capabilities: xbzrle: off auto-converge: off <---- > Migration status: active > total time: 1487503 milliseconds > expected downtime: 519 milliseconds > transferred ram: 383749347 kbytes > remaining ram: 2753372 kbytes > total ram: 268444224 kbytes > duplicate: 65461532 pages > skipped: 64901568 pages > normal: 95750218 pages > normal bytes: 383000872 kbytes > dirty pages rate: 67551 pages > > --- > > (qemu) info migrate > capabilities: xbzrle: off auto-converge: on <---- > Migration status: completed > total time: 241161 milliseconds > downtime: 6373 milliseconds > transferred ram: 28235307 kbytes > remaining ram: 0 kbytes > total ram: 268444224 kbytes > duplicate: 64946416 pages > skipped: 64903523 pages > normal: 7044971 pages > normal bytes: 28179884 kbytes > > --- > > Changes from v3: > - incorporated feedback from Paolo and Eric > - rebased to latest qemu.git > > Changes from v2: > - incorporated feedback from Orit, Juan and Eric > - stop the throttling thread at the start of stage 3 > - rebased to latest qemu.git > > Changes from v1: > - rebased to latest qemu.git > - added auto-converge capability(default off) - suggested by Anthony Liguori & > Eric Blake. > > Signed-off-by: Chegu Vinod > --- > arch_init.c | 61 ++++++++++++++++++++++++++++++++++++++++- > cpus.c | 41 +++++++++++++++++++++++++++ > include/migration/migration.h | 7 +++++ > include/qemu-common.h | 1 + > include/qemu/main-loop.h | 3 ++ > include/qom/cpu.h | 10 +++++++ > kvm-all.c | 46 +++++++++++++++++++++++++++++++ > migration.c | 18 ++++++++++++ > qapi-schema.json | 5 +++- > 9 files changed, 190 insertions(+), 2 deletions(-) > > diff --git a/arch_init.c b/arch_init.c > index 49c5dc2..2f703cf 100644 > --- a/arch_init.c > +++ b/arch_init.c > @@ -104,6 +104,7 @@ int graphic_depth = 15; > #endif > > const uint32_t arch_type = QEMU_ARCH; > +static bool mig_throttle_on; > > /***********************************************************/ > /* ram save/restore */ > @@ -379,7 +380,14 @@ static void migration_bitmap_sync(void) > MigrationState *s = migrate_get_current(); > static int64_t start_time; > static int64_t num_dirty_pages_period; > + static int64_t bytes_xfer_prev; > int64_t end_time; > + int64_t bytes_xfer_now; > + static int dirty_rate_high_cnt; > + > + if (!bytes_xfer_prev) { > + bytes_xfer_prev = ram_bytes_transferred(); > + } > > if (!start_time) { > start_time = qemu_get_clock_ms(rt_clock); > @@ -404,6 +412,27 @@ static void migration_bitmap_sync(void) > > /* more than 1 second = 1000 millisecons */ > if (end_time > start_time + 1000) { > + if (migrate_auto_converge()) { > + /* The following detection logic can be refined later. For now: > + Check to see if the dirtied bytes is 50% more than the approx. > + amount of bytes that just got transferred since the last time we > + were in this routine. If that happens N times (for now N==5) > + we turn on the throttle down logic */ > + bytes_xfer_now = ram_bytes_transferred(); > + if (s->dirty_pages_rate && > + ((num_dirty_pages_period*TARGET_PAGE_SIZE) > > + ((bytes_xfer_now - bytes_xfer_prev)/2))) { > + if (dirty_rate_high_cnt++ > 5) { > + DPRINTF("Unable to converge. Throtting down guest\n"); > + qemu_mutex_lock_mig_throttle(); You do not need this lock, you can assume that int accesses are atomic. > + if (!mig_throttle_on) { You do not need the if either (just set it to true). > + mig_throttle_on = true; > + } > + qemu_mutex_unlock_mig_throttle(); > + } > + } > + bytes_xfer_prev = bytes_xfer_now; > + } > s->dirty_pages_rate = num_dirty_pages_period * 1000 > / (end_time - start_time); > s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE; > @@ -496,6 +525,33 @@ static int ram_save_block(QEMUFile *f, bool last_stage) > return bytes_sent; > } > > +bool throttling_needed(void) > +{ > + bool value = false; > + > + if (!migrate_auto_converge()) { > + return false; > + } > + > + qemu_mutex_lock_mig_throttle(); > + value = mig_throttle_on; > + qemu_mutex_unlock_mig_throttle(); > + > + return value; > +} > + > +void stop_throttling(void) > +{ > + qemu_mutex_lock_mig_throttle(); > + mig_throttle_on = false; > + qemu_mutex_unlock_mig_throttle(); > + > + /* wait for the throttling thread to get out */ > + while (throttling_now()) { > + ; > + } > +} > + > static uint64_t bytes_transferred; > > static ram_addr_t ram_save_remaining(void) > @@ -544,6 +600,9 @@ static void migration_end(void) > > static void ram_migration_cancel(void *opaque) > { > + qemu_mutex_lock_mig_throttle(); > + mig_throttle_on = false; > + qemu_mutex_unlock_mig_throttle(); You set it to false in setup already, so this should not be necessary. > migration_end(); > } > > @@ -584,7 +643,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque) > qemu_mutex_lock_ramlist(); > bytes_transferred = 0; > reset_ram_globals(); > - > + mig_throttle_on = false; > memory_global_dirty_log_start(); > migration_bitmap_sync(); > qemu_mutex_unlock_iothread(); > diff --git a/cpus.c b/cpus.c > index c232265..100f1cf 100644 > --- a/cpus.c > +++ b/cpus.c > @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void) > #endif /* _WIN32 */ > > static QemuMutex qemu_global_mutex; > +static QemuMutex qemu_mig_throttle_mutex; > static QemuCond qemu_io_proceeded_cond; > static bool iothread_requesting_mutex; > > @@ -638,10 +639,36 @@ void qemu_init_cpu_loop(void) > qemu_cond_init(&qemu_work_cond); > qemu_cond_init(&qemu_io_proceeded_cond); > qemu_mutex_init(&qemu_global_mutex); > + qemu_mutex_init(&qemu_mig_throttle_mutex); > > qemu_thread_get_self(&io_thread); > } > > +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) > +{ > + struct qemu_work_item *wi; > + > + if (qemu_cpu_is_self(cpu)) { > + func(data); > + return; > + } > + > + wi = g_malloc0(sizeof(struct qemu_work_item)); > + wi->func = func; > + wi->data = data; > + wi->free = true; > + if (cpu->queued_work_first == NULL) { > + cpu->queued_work_first = wi; > + } else { > + cpu->queued_work_last->next = wi; > + } > + cpu->queued_work_last = wi; > + wi->next = NULL; > + wi->done = false; > + > + qemu_cpu_kick(cpu); > +} > + > void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) > { > struct qemu_work_item wi; > @@ -653,6 +680,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) > > wi.func = func; > wi.data = data; > + wi.free = false; > if (cpu->queued_work_first == NULL) { > cpu->queued_work_first = &wi; > } else { > @@ -683,6 +711,9 @@ static void flush_queued_work(CPUState *cpu) > cpu->queued_work_first = wi->next; > wi->func(wi->data); > wi->done = true; > + if (wi->free) { > + g_free(wi); > + } > } > cpu->queued_work_last = NULL; > qemu_cond_broadcast(&qemu_work_cond); This new functionality is good; please put it in a separate patch. > @@ -944,6 +975,16 @@ void qemu_mutex_unlock_iothread(void) > qemu_mutex_unlock(&qemu_global_mutex); > } > > +void qemu_mutex_lock_mig_throttle(void) > +{ > + qemu_mutex_lock(&qemu_mig_throttle_mutex); > +} > + > +void qemu_mutex_unlock_mig_throttle(void) > +{ > + qemu_mutex_unlock(&qemu_mig_throttle_mutex); > +} In general, it is better to avoid mutexes that are used across module boundaries. It becomes a maze very quickly (and indeed in this case you don't need it). > static int all_vcpus_paused(void) > { > CPUArchState *penv = first_cpu; > diff --git a/include/migration/migration.h b/include/migration/migration.h > index e2acec6..4b54dbf 100644 > --- a/include/migration/migration.h > +++ b/include/migration/migration.h > @@ -127,4 +127,11 @@ int migrate_use_xbzrle(void); > int64_t migrate_xbzrle_cache_size(void); > > int64_t xbzrle_cache_resize(int64_t new_size); > + > +bool migrate_auto_converge(void); > +bool throttling_needed(void); > +bool throttling_now(void); > +void stop_throttling(void); > +void *migration_throttle_down(void *); > + > #endif > diff --git a/include/qemu-common.h b/include/qemu-common.h > index b399d85..bad6e1f 100644 > --- a/include/qemu-common.h > +++ b/include/qemu-common.h > @@ -286,6 +286,7 @@ struct qemu_work_item { > void (*func)(void *data); > void *data; > int done; > + bool free; > }; > > #ifdef CONFIG_USER_ONLY > diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h > index 6f0200a..9a3886d 100644 > --- a/include/qemu/main-loop.h > +++ b/include/qemu/main-loop.h > @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void); > */ > void qemu_mutex_unlock_iothread(void); > > +void qemu_mutex_lock_mig_throttle(void); > +void qemu_mutex_unlock_mig_throttle(void); > + > /* internal interfaces */ > > void qemu_fd_register(int fd); > diff --git a/include/qom/cpu.h b/include/qom/cpu.h > index 7cd9442..46465e9 100644 > --- a/include/qom/cpu.h > +++ b/include/qom/cpu.h > @@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu); > void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); > > /** > + * async_run_on_cpu: > + * @cpu: The vCPU to run on. > + * @func: The function to be executed. > + * @data: Data to pass to the function. > + * > + * Schedules the function @func for execution on the vCPU @cpu asynchronously. > + */ > +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); > + > +/** > * qemu_for_each_cpu: > * @func: The function to be executed. > * @data: Data to pass to the function. > diff --git a/kvm-all.c b/kvm-all.c > index 3a31602..33e6d55 100644 > --- a/kvm-all.c > +++ b/kvm-all.c > @@ -34,6 +34,8 @@ > #include "exec/address-spaces.h" > #include "qemu/event_notifier.h" > #include "trace.h" > +#include "sysemu/cpus.h" > +#include "migration/migration.h" > > /* This check must be after config-host.h is included */ > #ifdef CONFIG_EVENTFD > @@ -2038,3 +2040,47 @@ int kvm_on_sigbus(int code, void *addr) > { > return kvm_arch_on_sigbus(code, addr); > } > + > +static bool throttling; > +bool throttling_now(void) > +{ > + return throttling; > +} > + > +static void mig_delay_vcpu(void) > +{ > + qemu_mutex_unlock_iothread(); > + g_usleep(50*1000); > + qemu_mutex_lock_iothread(); > +} > + > + > +/* Stub used for getting the vcpu out of VM and into qemu via > + run_on_cpu()*/ > +static void mig_kick_cpu(void *opq) > +{ > + mig_delay_vcpu(); > + return; > +} > + > +/* To reduce the dirty rate explicitly disallow the VCPUs from spending > + much time in the VM. The migration thread will try to catchup. > + Workload will experience a greater performance drop but for a shorter > + duration. > +*/ > +void *migration_throttle_down(void *opaque) > +{ > + throttling = true; > + while (throttling_needed()) { > + CPUArchState *penv = first_cpu; > + while (penv) { > + qemu_mutex_lock_iothread(); > + async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); > + qemu_mutex_unlock_iothread(); > + penv = penv->next_cpu; > + } > + g_usleep(25*1000); You're stunning the VCPUs for 50 ms every 25 ms. Maybe I'm missing something, but why isn't this stopping the VM altogether? > + } > + throttling = false; > + return NULL; > +} You do not need a separate thread. You can do this directly in the migration thread, the overhead of async_run_on_cpu is small. That will remove the need for the throttling and mig_throttle_on variables (just conditionalize it on migrate_auto_converge()). Instead of the g_usleep(25*1000) that you have, you can save the value of vm_clock at the time of the last "stun", and only do another one if enough time has passed. The functionality should be all in arch_init.c (apart from async_run_on_cpu and the definition of the capability, of course). Paolo > diff --git a/migration.c b/migration.c > index 3eb0fad..d170e7b 100644 > --- a/migration.c > +++ b/migration.c > @@ -24,6 +24,7 @@ > #include "qemu/thread.h" > #include "qmp-commands.h" > #include "trace.h" > +#include "sysemu/cpus.h" > > //#define DEBUG_MIGRATION > > @@ -474,6 +475,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) > max_downtime = (uint64_t)value; > } > > +bool migrate_auto_converge(void) > +{ > + MigrationState *s; > + > + s = migrate_get_current(); > + > + return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE]; > +} > + > int migrate_use_xbzrle(void) > { > MigrationState *s; > @@ -503,6 +513,7 @@ static void *migration_thread(void *opaque) > int64_t max_size = 0; > int64_t start_time = initial_time; > bool old_vm_running = false; > + QemuThread thread; > > DPRINTF("beginning savevm\n"); > qemu_savevm_state_begin(s->file, &s->params); > @@ -517,8 +528,15 @@ static void *migration_thread(void *opaque) > DPRINTF("pending size %lu max %lu\n", pending_size, max_size); > if (pending_size && pending_size >= max_size) { > qemu_savevm_state_iterate(s->file); > + if (throttling_needed() && !throttling_now()) { > + qemu_thread_create(&thread, migration_throttle_down, > + NULL, QEMU_THREAD_DETACHED); > + } > } else { > DPRINTF("done iterating\n"); > + if (throttling_now()) { > + stop_throttling(); > + } > qemu_mutex_lock_iothread(); > start_time = qemu_get_clock_ms(rt_clock); > qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER); > diff --git a/qapi-schema.json b/qapi-schema.json > index 7797400..b465d91 100644 > --- a/qapi-schema.json > +++ b/qapi-schema.json > @@ -602,10 +602,13 @@ > # This feature allows us to minimize migration traffic for certain work > # loads, by sending compressed difference of the pages > # > +# @auto-converge: Migration supports automatic throttling down of guest > +# to force convergence. (since 1.6) > +# > # Since: 1.2 > ## > { 'enum': 'MigrationCapability', > - 'data': ['xbzrle'] } > + 'data': ['xbzrle', 'auto-converge'] } > > ## > # @MigrationCapabilityStatus >