From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:58498)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UZfsA-0001dh-5e
	for qemu-devel@nongnu.org; Tue, 07 May 2013 07:16:00 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UZfs6-00014B-Hp
	for qemu-devel@nongnu.org; Tue, 07 May 2013 07:15:54 -0400
Received: from mx1.redhat.com ([209.132.183.28]:64564)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UZfs6-00013g-7D
	for qemu-devel@nongnu.org; Tue, 07 May 2013 07:15:50 -0400
Message-ID: <5188E243.6060308@redhat.com>
Date: Tue, 07 May 2013 13:15:15 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1367902826-28190-1-git-send-email-chegu_vinod@hp.com>
In-Reply-To: <1367902826-28190-1-git-send-email-chegu_vinod@hp.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH v4] Throttle-down guest when live
 migration does not converge.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Chegu Vinod <chegu_vinod@hp.com>
Cc: owasserm@redhat.com, qemu-devel@nongnu.org, anthony@codemonkey.ws, quintela@redhat.com

Il 07/05/2013 07:00, Chegu Vinod ha scritto:
> Busy enterprise workloads hosted on large sized VM's tend to dirty
> memory faster than the transfer rate achieved via live guest migration.
> Despite some good recent improvements (& using dedicated 10Gig NICs
> between hosts) the live migration does NOT converge.
> 
> If a user chooses to force convergence of their migration via a new
> migration capability "auto-converge" then this change will auto-detect
> lack of convergence scenario and trigger a slow down of the workload
> by explicitly disallowing the VCPUs from spending much time in the VM
> context.
> 
> The migration thread tries to catchup and this eventually leads
> to convergence in some "deterministic" amount of time. Yes it does
> impact the performance of all the VCPUs but in my observation that
> lasts only for a short duration of time. i.e. we end up entering
> stage 3 (downtime phase) soon after that. No external trigger is
> required.
> 
> Thanks to Juan and Paolo for their useful suggestions.
> 
> Verified the convergence using the following :
> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
> 
> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
> migrate downtime set to 4seconds).
> 
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: off  <----
> Migration status: active
> total time: 1487503 milliseconds
> expected downtime: 519 milliseconds
> transferred ram: 383749347 kbytes
> remaining ram: 2753372 kbytes
> total ram: 268444224 kbytes
> duplicate: 65461532 pages
> skipped: 64901568 pages
> normal: 95750218 pages
> normal bytes: 383000872 kbytes
> dirty pages rate: 67551 pages
> 
> ---
> 
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: on   <----
> Migration status: completed
> total time: 241161 milliseconds
> downtime: 6373 milliseconds
> transferred ram: 28235307 kbytes
> remaining ram: 0 kbytes
> total ram: 268444224 kbytes
> duplicate: 64946416 pages
> skipped: 64903523 pages
> normal: 7044971 pages
> normal bytes: 28179884 kbytes
> 
> ---
> 
> Changes from v3:
> - incorporated feedback from Paolo and Eric
> - rebased to latest qemu.git
> 
> Changes from v2:
> - incorporated feedback from Orit, Juan and Eric
> - stop the throttling thread at the start of stage 3
> - rebased to latest qemu.git
> 
> Changes from v1:
> - rebased to latest qemu.git
> - added auto-converge capability(default off) - suggested by Anthony Liguori &
>                                                 Eric Blake.
> 
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  arch_init.c                   |   61 ++++++++++++++++++++++++++++++++++++++++-
>  cpus.c                        |   41 +++++++++++++++++++++++++++
>  include/migration/migration.h |    7 +++++
>  include/qemu-common.h         |    1 +
>  include/qemu/main-loop.h      |    3 ++
>  include/qom/cpu.h             |   10 +++++++
>  kvm-all.c                     |   46 +++++++++++++++++++++++++++++++
>  migration.c                   |   18 ++++++++++++
>  qapi-schema.json              |    5 +++-
>  9 files changed, 190 insertions(+), 2 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 49c5dc2..2f703cf 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -104,6 +104,7 @@ int graphic_depth = 15;
>  #endif
>  
>  const uint32_t arch_type = QEMU_ARCH;
> +static bool mig_throttle_on;
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -379,7 +380,14 @@ static void migration_bitmap_sync(void)
>      MigrationState *s = migrate_get_current();
>      static int64_t start_time;
>      static int64_t num_dirty_pages_period;
> +    static int64_t bytes_xfer_prev;
>      int64_t end_time;
> +    int64_t bytes_xfer_now;
> +    static int dirty_rate_high_cnt;
> +
> +    if (!bytes_xfer_prev) {
> +        bytes_xfer_prev = ram_bytes_transferred();
> +    }
>  
>      if (!start_time) {
>          start_time = qemu_get_clock_ms(rt_clock);
> @@ -404,6 +412,27 @@ static void migration_bitmap_sync(void)
>  
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > start_time + 1000) {
> +        if (migrate_auto_converge()) {
> +            /* The following detection logic can be refined later. For now:
> +               Check to see if the dirtied bytes is 50% more than the approx.
> +               amount of bytes that just got transferred since the last time we
> +               were in this routine. If that happens N times (for now N==5)
> +               we turn on the throttle down logic */
> +            bytes_xfer_now = ram_bytes_transferred();
> +            if (s->dirty_pages_rate &&
> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
> +                if (dirty_rate_high_cnt++ > 5) {
> +                    DPRINTF("Unable to converge. Throtting down guest\n");
> +                    qemu_mutex_lock_mig_throttle();

You do not need this lock, you can assume that int accesses are atomic.

> +                    if (!mig_throttle_on) {

You do not need the if either (just set it to true).

> +                        mig_throttle_on = true;
> +                    }
> +                    qemu_mutex_unlock_mig_throttle();


> +                }
> +             }
> +             bytes_xfer_prev = bytes_xfer_now;
> +        }
>          s->dirty_pages_rate = num_dirty_pages_period * 1000
>              / (end_time - start_time);
>          s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
> @@ -496,6 +525,33 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>      return bytes_sent;
>  }
>  
> +bool throttling_needed(void)
> +{
> +    bool value = false;
> +
> +    if (!migrate_auto_converge()) {
> +        return false;
> +    }
> +
> +    qemu_mutex_lock_mig_throttle();
> +    value = mig_throttle_on;
> +    qemu_mutex_unlock_mig_throttle();
> +
> +    return value;
> +}
> +
> +void stop_throttling(void)
> +{
> +    qemu_mutex_lock_mig_throttle();
> +    mig_throttle_on = false;
> +    qemu_mutex_unlock_mig_throttle();
> +
> +    /* wait for the throttling thread to get out */
> +    while (throttling_now()) {
> +        ;
> +    }
> +}
> +
>  static uint64_t bytes_transferred;
>  
>  static ram_addr_t ram_save_remaining(void)
> @@ -544,6 +600,9 @@ static void migration_end(void)
>  
>  static void ram_migration_cancel(void *opaque)
>  {
> +    qemu_mutex_lock_mig_throttle();
> +    mig_throttle_on = false;
> +    qemu_mutex_unlock_mig_throttle();

You set it to false in setup already, so this should not be necessary.

>      migration_end();
>  }
>  
> @@ -584,7 +643,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      qemu_mutex_lock_ramlist();
>      bytes_transferred = 0;
>      reset_ram_globals();
> -
> +    mig_throttle_on = false;
>      memory_global_dirty_log_start();
>      migration_bitmap_sync();
>      qemu_mutex_unlock_iothread();
> diff --git a/cpus.c b/cpus.c
> index c232265..100f1cf 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void)
>  #endif /* _WIN32 */
>  
>  static QemuMutex qemu_global_mutex;
> +static QemuMutex qemu_mig_throttle_mutex;
>  static QemuCond qemu_io_proceeded_cond;
>  static bool iothread_requesting_mutex;
>  
> @@ -638,10 +639,36 @@ void qemu_init_cpu_loop(void)
>      qemu_cond_init(&qemu_work_cond);
>      qemu_cond_init(&qemu_io_proceeded_cond);
>      qemu_mutex_init(&qemu_global_mutex);
> +    qemu_mutex_init(&qemu_mig_throttle_mutex);
>  
>      qemu_thread_get_self(&io_thread);
>  }
>  
> +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
> +{
> +    struct qemu_work_item *wi;
> +
> +    if (qemu_cpu_is_self(cpu)) {
> +        func(data);
> +        return;
> +    }
> +
> +    wi = g_malloc0(sizeof(struct qemu_work_item));
> +    wi->func = func;
> +    wi->data = data;
> +    wi->free = true;
> +    if (cpu->queued_work_first == NULL) {
> +        cpu->queued_work_first = wi;
> +    } else {
> +        cpu->queued_work_last->next = wi;
> +    }
> +    cpu->queued_work_last = wi;
> +    wi->next = NULL;
> +    wi->done = false;
> +
> +    qemu_cpu_kick(cpu);
> +}
> +
>  void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
>  {
>      struct qemu_work_item wi;
> @@ -653,6 +680,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
>  
>      wi.func = func;
>      wi.data = data;
> +    wi.free = false;
>      if (cpu->queued_work_first == NULL) {
>          cpu->queued_work_first = &wi;
>      } else {
> @@ -683,6 +711,9 @@ static void flush_queued_work(CPUState *cpu)
>          cpu->queued_work_first = wi->next;
>          wi->func(wi->data);
>          wi->done = true;
> +        if (wi->free) {
> +            g_free(wi);
> +        }
>      }
>      cpu->queued_work_last = NULL;
>      qemu_cond_broadcast(&qemu_work_cond);

This new functionality is good; please put it in a separate patch.

> @@ -944,6 +975,16 @@ void qemu_mutex_unlock_iothread(void)
>      qemu_mutex_unlock(&qemu_global_mutex);
>  }
>  
> +void qemu_mutex_lock_mig_throttle(void)
> +{
> +    qemu_mutex_lock(&qemu_mig_throttle_mutex);
> +}
> +
> +void qemu_mutex_unlock_mig_throttle(void)
> +{
> +    qemu_mutex_unlock(&qemu_mig_throttle_mutex);
> +}

In general, it is better to avoid mutexes that are used across module
boundaries.  It becomes a maze very quickly (and indeed in this case you
don't need it).

>  static int all_vcpus_paused(void)
>  {
>      CPUArchState *penv = first_cpu;
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e2acec6..4b54dbf 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -127,4 +127,11 @@ int migrate_use_xbzrle(void);
>  int64_t migrate_xbzrle_cache_size(void);
>  
>  int64_t xbzrle_cache_resize(int64_t new_size);
> +
> +bool migrate_auto_converge(void);
> +bool throttling_needed(void);
> +bool throttling_now(void);
> +void stop_throttling(void);
> +void *migration_throttle_down(void *);
> +
>  #endif
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index b399d85..bad6e1f 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -286,6 +286,7 @@ struct qemu_work_item {
>      void (*func)(void *data);
>      void *data;
>      int done;
> +    bool free;
>  };
>  
>  #ifdef CONFIG_USER_ONLY
> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
> index 6f0200a..9a3886d 100644
> --- a/include/qemu/main-loop.h
> +++ b/include/qemu/main-loop.h
> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>   */
>  void qemu_mutex_unlock_iothread(void);
>  
> +void qemu_mutex_lock_mig_throttle(void);
> +void qemu_mutex_unlock_mig_throttle(void);
> +
>  /* internal interfaces */
>  
>  void qemu_fd_register(int fd);
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 7cd9442..46465e9 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu);
>  void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
>  
>  /**
> + * async_run_on_cpu:
> + * @cpu: The vCPU to run on.
> + * @func: The function to be executed.
> + * @data: Data to pass to the function.
> + *
> + * Schedules the function @func for execution on the vCPU @cpu asynchronously.
> + */
> +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
> +
> +/**
>   * qemu_for_each_cpu:
>   * @func: The function to be executed.
>   * @data: Data to pass to the function.
> diff --git a/kvm-all.c b/kvm-all.c
> index 3a31602..33e6d55 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -34,6 +34,8 @@
>  #include "exec/address-spaces.h"
>  #include "qemu/event_notifier.h"
>  #include "trace.h"
> +#include "sysemu/cpus.h"
> +#include "migration/migration.h"
>  
>  /* This check must be after config-host.h is included */
>  #ifdef CONFIG_EVENTFD
> @@ -2038,3 +2040,47 @@ int kvm_on_sigbus(int code, void *addr)
>  {
>      return kvm_arch_on_sigbus(code, addr);
>  }
> +
> +static bool throttling;
> +bool throttling_now(void)
> +{
> +    return throttling;
> +}
> +
> +static void mig_delay_vcpu(void)
> +{
> +    qemu_mutex_unlock_iothread();
> +    g_usleep(50*1000);
> +    qemu_mutex_lock_iothread();
> +}
> +
> +
> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    mig_delay_vcpu();
> +    return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a greater performance drop but for a shorter
> +   duration.
> +*/
> +void *migration_throttle_down(void *opaque)
> +{
> +    throttling = true;
> +    while (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
> +        }
> +        g_usleep(25*1000);

You're stunning the VCPUs for 50 ms every 25 ms.  Maybe I'm missing
something, but why isn't this stopping the VM altogether?
> +    }
> +    throttling = false;
> +    return NULL;
> +}

You do not need a separate thread.  You can do this directly in the
migration thread, the overhead of async_run_on_cpu is small.  That will
remove the need for the throttling and mig_throttle_on variables (just
conditionalize it on migrate_auto_converge()).

Instead of the g_usleep(25*1000) that you have, you can save the value
of vm_clock at the time of the last "stun", and only do another one if
enough time has passed.

The functionality should be all in arch_init.c (apart from
async_run_on_cpu and the definition of the capability, of course).

Paolo

> diff --git a/migration.c b/migration.c
> index 3eb0fad..d170e7b 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -24,6 +24,7 @@
>  #include "qemu/thread.h"
>  #include "qmp-commands.h"
>  #include "trace.h"
> +#include "sysemu/cpus.h"
>  
>  //#define DEBUG_MIGRATION
>  
> @@ -474,6 +475,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>      max_downtime = (uint64_t)value;
>  }
>  
> +bool migrate_auto_converge(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
> +}
> +
>  int migrate_use_xbzrle(void)
>  {
>      MigrationState *s;
> @@ -503,6 +513,7 @@ static void *migration_thread(void *opaque)
>      int64_t max_size = 0;
>      int64_t start_time = initial_time;
>      bool old_vm_running = false;
> +    QemuThread thread;
>  
>      DPRINTF("beginning savevm\n");
>      qemu_savevm_state_begin(s->file, &s->params);
> @@ -517,8 +528,15 @@ static void *migration_thread(void *opaque)
>              DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>              if (pending_size && pending_size >= max_size) {
>                  qemu_savevm_state_iterate(s->file);
> +                if (throttling_needed() && !throttling_now()) {
> +                    qemu_thread_create(&thread, migration_throttle_down,
> +                               NULL, QEMU_THREAD_DETACHED);
> +                }
>              } else {
>                  DPRINTF("done iterating\n");
> +                if (throttling_now()) {
> +                    stop_throttling();
> +                }
>                  qemu_mutex_lock_iothread();
>                  start_time = qemu_get_clock_ms(rt_clock);
>                  qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER);
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7797400..b465d91 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -602,10 +602,13 @@
>  #          This feature allows us to minimize migration traffic for certain work
>  #          loads, by sending compressed difference of the pages
>  #
> +# @auto-converge: Migration supports automatic throttling down of guest
> +#          to force convergence. (since 1.6)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle'] }
> +  'data': ['xbzrle', 'auto-converge'] }
>  
>  ##
>  # @MigrationCapabilityStatus
>