From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:58686)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1UVBTe-0007Lo-HX
	for qemu-devel@nongnu.org; Wed, 24 Apr 2013 22:00:09 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1UVBTY-0006NW-HR
	for qemu-devel@nongnu.org; Wed, 24 Apr 2013 22:00:02 -0400
Received: from mail-oa0-f43.google.com ([209.85.219.43]:56473)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1UVBTY-0006NN-3s
	for qemu-devel@nongnu.org; Wed, 24 Apr 2013 21:59:56 -0400
Received: by mail-oa0-f43.google.com with SMTP id k7so2397707oag.2
	for <qemu-devel@nongnu.org>; Wed, 24 Apr 2013 18:59:55 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <1366854124-16348-1-git-send-email-chegu_vinod@hp.com>
References: <1366854124-16348-1-git-send-email-chegu_vinod@hp.com>
Date: Wed, 24 Apr 2013 18:59:55 -0700
Message-ID: <CA+aC4ku_pXbu1syWLQXXm_WxeBKODxCbgYYW4OseDNkg_dmLmA@mail.gmail.com>
From: Anthony Liguori <anthony@codemonkey.ws>
Content-Type: multipart/alternative; boundary=001a11c2e93c6126b704db25c6f3
Subject: Re: [Qemu-devel] [RFC PATCH] Throttle-down guest when live
 migration does not converge.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Chegu Vinod <chegu_vinod@hp.com>
Cc: qemu-devel <qemu-devel@nongnu.org>

--001a11c2e93c6126b704db25c6f3
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Apr 24, 2013 at 6:42 PM, Chegu Vinod <chegu_vinod@hp.com> wrote:

> Busy enterprise workloads hosted on large sized VM's tend to dirty
> memory faster than the transfer rate achieved via live guest migration.
> Despite some good recent improvements (& using dedicated 10Gig NICs
> between hosts) the live migration does NOT converge.
>
> A few options that were discussed/being-pursued to help with
> the convergence issue include:
>
> 1) Slow down guest considerably via cgroup's CPU controls - requires
>    libvirt client support to detect & trigger action, but conceptually
>    similar to this RFC change.
>
> 2) Speed up transfer rate:
>    - RDMA based Pre-copy - lower overhead and fast (Unfortunately
>      has a few restrictions and some customers still choose not
>      to deploy RDMA :-( ).
>    - Add parallelism to improve transfer rate and use multiple 10Gig
>      connections (bonded). - could add some overhead on the host.
>
> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
>    promising but need to consider & handle newer failure scenarios.
>
> The following [RFC] change attempts to auto-detect lack of convergence
> situation and trigger a slowdown of the workload by explicitly
> disallowing the VCPUs from spending much time in the VM context.
> No exernal trigger is required (unlike option 1) and it can co-exist
> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>
> The migration thread tries to catchup and this eventually leads
> to convergence in some "deterministic" amount of time. Yes it does
> impact the performance of all the VCPUs but in my observation that
> lasts only for a short duration of time. i.e. we end up entering
> stage 3 (downtime phase) soon after that.
>

This is a reasonable idea and approach but it cannot be unconditional.
 Sacrificing VCPU performance to encourage convergence is a management
decision.  In some cases, VCPU performance is far more important than
migration convergence.

Regards,

Anthony Liguori


> Verified the convergence using the following:
> - SpecJbb2005 workload running on a 20VCPU/128G guest(~80% busy)
> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>
> Thanks to Juan and Paolo for some useful suggestions. More
> refinment is needed (e.g. smarter way to detect & variable
> throttling based on need etc). For now I was hoping to get
> some feedback or hear about other more refined ideas.
>
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  arch_init.c                   |   37 +++++++++++++++++++++++++++++++
>  cpus.c                        |   12 ++++++++++
>  include/migration/migration.h |    9 +++++++
>  include/qemu/main-loop.h      |    3 ++
>  kvm-all.c                     |   49
> +++++++++++++++++++++++++++++++++++++++++
>  migration.c                   |    6 +++++
>  6 files changed, 116 insertions(+), 0 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index 92de1bd..a06ff81 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -104,6 +104,7 @@ int graphic_depth = 15;
>  #endif
>
>  const uint32_t arch_type = QEMU_ARCH;
> +static uint64_t mig_throttle_on;
>
>  /***********************************************************/
>  /* ram save/restore */
> @@ -379,12 +380,19 @@ static void migration_bitmap_sync(void)
>      MigrationState *s = migrate_get_current();
>      static int64_t start_time;
>      static int64_t num_dirty_pages_period;
> +    static int64_t bytes_xfer_prev;
>      int64_t end_time;
> +    int64_t bytes_xfer_now;
> +    static int dirty_rate_high_cnt;
>
>      if (!start_time) {
>          start_time = qemu_get_clock_ms(rt_clock);
>      }
>
> +    if (!bytes_xfer_prev) {
> +        bytes_xfer_prev = ram_bytes_transferred();
> +    }
> +
>      trace_migration_bitmap_sync_start();
>      memory_global_sync_dirty_bitmap(get_system_memory());
>
> @@ -404,6 +412,23 @@ static void migration_bitmap_sync(void)
>
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > start_time + 1000) {
> +         /* The following detection logic can be refined later. For now:
> +          Check to see if the dirtied bytes is 50% more than the approx.
> +          amount of bytes that just got transferred since the last time we
> +          were in this routine. If that happens N times (for now N==5)
> +          we turn on the throttle down logic */
> +         bytes_xfer_now = ram_bytes_transferred();
> +         if (s->dirty_pages_rate &&
> +             ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
> +             ((bytes_xfer_now - bytes_xfer_prev)/2))) {
> +             if (dirty_rate_high_cnt++ > 5) {
> +                 DPRINTF("Unable to converge. Throtting down guest\n");
> +                 mig_throttle_on = 1;
> +             }
> +        }
> +        bytes_xfer_prev = bytes_xfer_now;
> +
>          s->dirty_pages_rate = num_dirty_pages_period * 1000
>              / (end_time - start_time);
>          s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
> @@ -496,6 +521,18 @@ static int ram_save_block(QEMUFile *f, bool
> last_stage)
>      return bytes_sent;
>  }
>
> +bool throttling_needed(void)
> +{
> +    bool value;
> +
> +    qemu_mutex_lock_mig_throttle();
> +    value = mig_throttle_on;
> +    qemu_mutex_unlock_mig_throttle();
> +
> +    if (value) {
> +        return true;
> +    }
> +    return false;
> +}
> +
>  static uint64_t bytes_transferred;
>
>  static ram_addr_t ram_save_remaining(void)
> diff --git a/cpus.c b/cpus.c
> index 5a98a37..eea6601 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void)
>  #endif /* _WIN32 */
>
>  static QemuMutex qemu_global_mutex;
> +static QemuMutex qemu_mig_throttle_mutex;
>  static QemuCond qemu_io_proceeded_cond;
>  static bool iothread_requesting_mutex;
>
> @@ -638,6 +639,7 @@ void qemu_init_cpu_loop(void)
>      qemu_cond_init(&qemu_work_cond);
>      qemu_cond_init(&qemu_io_proceeded_cond);
>      qemu_mutex_init(&qemu_global_mutex);
> +    qemu_mutex_init(&qemu_mig_throttle_mutex);
>
>      qemu_thread_get_self(&io_thread);
>  }
> @@ -923,6 +925,16 @@ static bool qemu_in_vcpu_thread(void)
>      return cpu_single_env &&
> qemu_cpu_is_self(ENV_GET_CPU(cpu_single_env));
>  }
>
> +void qemu_mutex_lock_mig_throttle(void)
> +{
> +    qemu_mutex_lock(&qemu_mig_throttle_mutex);
> +}
> +
> +void qemu_mutex_unlock_mig_throttle(void)
> +{
> +    qemu_mutex_unlock(&qemu_mig_throttle_mutex);
> +}
> +
>  void qemu_mutex_lock_iothread(void)
>  {
>      if (!tcg_enabled()) {
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e2acec6..cccee91 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -92,6 +92,15 @@ uint64_t ram_bytes_remaining(void);
>  uint64_t ram_bytes_transferred(void);
>  uint64_t ram_bytes_total(void);
>
> +#ifndef _QEMU_MIG_THROTTLE
> +#define _QEMU_MIG_THROTTLE
> +
> +bool throttling_needed(void);
> +bool throttling_now(void);
> +void *migration_throttle_down(void *);
> +
> +#endif
> +
>  extern SaveVMHandlers savevm_ram_handlers;
>
>  uint64_t dup_mig_bytes_transferred(void);
> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
> index 6f0200a..9a3886d 100644
> --- a/include/qemu/main-loop.h
> +++ b/include/qemu/main-loop.h
> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>   */
>  void qemu_mutex_unlock_iothread(void);
>
> +void qemu_mutex_lock_mig_throttle(void);
> +void qemu_mutex_unlock_mig_throttle(void);
> +
>  /* internal interfaces */
>
>  void qemu_fd_register(int fd);
> diff --git a/kvm-all.c b/kvm-all.c
> index 2d92721..95010ce 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -33,6 +33,8 @@
>  #include "exec/memory.h"
>  #include "exec/address-spaces.h"
>  #include "qemu/event_notifier.h"
> +#include "sysemu/cpus.h"
> +#include "migration/migration.h"
>
>  /* This check must be after config-host.h is included */
>  #ifdef CONFIG_EVENTFD
> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo
> kvm_required_capabilites[] = {
>      KVM_CAP_LAST_INFO
>  };
>
> +static void mig_delay_vcpu(void);
> +
>  static KVMSlot *kvm_alloc_slot(KVMState *s)
>  {
>      int i;
> @@ -1609,6 +1613,10 @@ int kvm_cpu_exec(CPUArchState *env)
>          }
>          qemu_mutex_unlock_iothread();
>
> +        if (throttling_needed()) {
> +            mig_delay_vcpu();
> +        }
> +
>          run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
>
>          qemu_mutex_lock_iothread();
> @@ -2032,3 +2040,44 @@ int kvm_on_sigbus(int code, void *addr)
>  {
>      return kvm_arch_on_sigbus(code, addr);
>  }
> +
> +static bool throttling;
> +bool throttling_now(void)
> +{
> +    if (throttling) {
> +        return true;
> +    }
> +    return false;
> +}
> +
> +static void mig_delay_vcpu(void)
> +{
> +    g_usleep(50*1000);
> +}
> +
> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a greater performance drop but for a shorter
> +   duration.
> +*/
> +void *migration_throttle_down(void *opaque)
> +{
> +    throttling = true;
> +    while (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
> +        }
> +        g_usleep(25*1000);
> +    }
> +    throttling = false;
> +    return NULL;
> +}
> diff --git a/migration.c b/migration.c
> index 3eb0fad..a464afc 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -24,6 +24,7 @@
>  #include "qemu/thread.h"
>  #include "qmp-commands.h"
>  #include "trace.h"
> +#include "sysemu/cpus.h"
>
>  //#define DEBUG_MIGRATION
>
> @@ -503,6 +504,7 @@ static void *migration_thread(void *opaque)
>      int64_t max_size = 0;
>      int64_t start_time = initial_time;
>      bool old_vm_running = false;
> +    QemuThread thread;
>
>      DPRINTF("beginning savevm\n");
>      qemu_savevm_state_begin(s->file, &s->params);
> @@ -517,6 +519,10 @@ static void *migration_thread(void *opaque)
>              DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>              if (pending_size && pending_size >= max_size) {
>                  qemu_savevm_state_iterate(s->file);
> +                if (throttling_needed() && !throttling_now()) {
> +                    qemu_thread_create(&thread, migration_throttle_down,
> +                               NULL, QEMU_THREAD_DETACHED);
> +                }
>              } else {
>                  DPRINTF("done iterating\n");
>                  qemu_mutex_lock_iothread();
> --
> 1.7.1
>
>
>

--001a11c2e93c6126b704db25c6f3
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Wed, Apr 24, 2013 at 6:42 PM, Chegu Vinod <span dir=3D"=
ltr">&lt;<a href=3D"mailto:chegu_vinod@hp.com" target=3D"_blank">chegu_vino=
d@hp.com</a>&gt;</span> wrote:<br><div class=3D"gmail_extra"><div class=3D"=
gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Busy enterprise workloads hosted on large si=
zed VM&#39;s tend to dirty<br>
memory faster than the transfer rate achieved via live guest migration.<br>
Despite some good recent improvements (&amp; using dedicated 10Gig NICs<br>
between hosts) the live migration does NOT converge.<br>
<br>
A few options that were discussed/being-pursued to help with<br>
the convergence issue include:<br>
<br>
1) Slow down guest considerably via cgroup&#39;s CPU controls - requires<br=
>
=A0 =A0libvirt client support to detect &amp; trigger action, but conceptua=
lly<br>
=A0 =A0similar to this RFC change.<br>
<br>
2) Speed up transfer rate:<br>
=A0 =A0- RDMA based Pre-copy - lower overhead and fast (Unfortunately<br>
=A0 =A0 =A0has a few restrictions and some customers still choose not<br>
=A0 =A0 =A0to deploy RDMA :-( ).<br>
=A0 =A0- Add parallelism to improve transfer rate and use multiple 10Gig<br=
>
=A0 =A0 =A0connections (bonded). - could add some overhead on the host.<br>
<br>
3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds<br>
=A0 =A0promising but need to consider &amp; handle newer failure scenarios.=
<br>
<br>
The following [RFC] change attempts to auto-detect lack of convergence<br>
situation and trigger a slowdown of the workload by explicitly<br>
disallowing the VCPUs from spending much time in the VM context.<br>
No exernal trigger is required (unlike option 1) and it can co-exist<br>
with enhancements being pursued as part of Option 2 (e.g. RDMA).<br>
<br>
The migration thread tries to catchup and this eventually leads<br>
to convergence in some &quot;deterministic&quot; amount of time. Yes it doe=
s<br>
impact the performance of all the VCPUs but in my observation that<br>
lasts only for a short duration of time. i.e. we end up entering<br>
stage 3 (downtime phase) soon after that.<br></blockquote><div><br></div><d=
iv style>This is a reasonable idea and approach but it cannot be unconditio=
nal. =A0Sacrificing VCPU performance to encourage convergence is a manageme=
nt decision. =A0In some cases, VCPU performance is far more important than =
migration convergence.</div>
<div style><br></div><div style>Regards,</div><div style><br></div><div sty=
le>Anthony Liguori</div><div>=A0</div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Verified the convergence using the following:<br>
- SpecJbb2005 workload running on a 20VCPU/128G guest(~80% busy)<br>
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)<br>
<br>
Thanks to Juan and Paolo for some useful suggestions. More<br>
refinment is needed (e.g. smarter way to detect &amp; variable<br>
throttling based on need etc). For now I was hoping to get<br>
some feedback or hear about other more refined ideas.<br>
<br>
Signed-off-by: Chegu Vinod &lt;<a href=3D"mailto:chegu_vinod@hp.com">chegu_=
vinod@hp.com</a>&gt;<br>
---<br>
=A0arch_init.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 37 +++++++++++++++=
++++++++++++++++<br>
=A0cpus.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 12 +++++++++=
+<br>
=A0include/migration/migration.h | =A0 =A09 +++++++<br>
=A0include/qemu/main-loop.h =A0 =A0 =A0| =A0 =A03 ++<br>
=A0kvm-all.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 49 +++++++++++++=
++++++++++++++++++++++++++++<br>
=A0migration.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A06 +++++<br>
=A06 files changed, 116 insertions(+), 0 deletions(-)<br>
<br>
diff --git a/arch_init.c b/arch_init.c<br>
index 92de1bd..a06ff81 100644<br>
--- a/arch_init.c<br>
+++ b/arch_init.c<br>
@@ -104,6 +104,7 @@ int graphic_depth =3D 15;<br>
=A0#endif<br>
<br>
=A0const uint32_t arch_type =3D QEMU_ARCH;<br>
+static uint64_t mig_throttle_on;<br>
<br>
=A0/***********************************************************/<br>
=A0/* ram save/restore */<br>
@@ -379,12 +380,19 @@ static void migration_bitmap_sync(void)<br>
=A0 =A0 =A0MigrationState *s =3D migrate_get_current();<br>
=A0 =A0 =A0static int64_t start_time;<br>
=A0 =A0 =A0static int64_t num_dirty_pages_period;<br>
+ =A0 =A0static int64_t bytes_xfer_prev;<br>
=A0 =A0 =A0int64_t end_time;<br>
+ =A0 =A0int64_t bytes_xfer_now;<br>
+ =A0 =A0static int dirty_rate_high_cnt;<br>
<br>
=A0 =A0 =A0if (!start_time) {<br>
=A0 =A0 =A0 =A0 =A0start_time =3D qemu_get_clock_ms(rt_clock);<br>
=A0 =A0 =A0}<br>
<br>
+ =A0 =A0if (!bytes_xfer_prev) {<br>
+ =A0 =A0 =A0 =A0bytes_xfer_prev =3D ram_bytes_transferred();<br>
+ =A0 =A0}<br>
+<br>
=A0 =A0 =A0trace_migration_bitmap_sync_start();<br>
=A0 =A0 =A0memory_global_sync_dirty_bitmap(get_system_memory());<br>
<br>
@@ -404,6 +412,23 @@ static void migration_bitmap_sync(void)<br>
<br>
=A0 =A0 =A0/* more than 1 second =3D 1000 millisecons */<br>
=A0 =A0 =A0if (end_time &gt; start_time + 1000) {<br>
+ =A0 =A0 =A0 =A0 /* The following detection logic can be refined later. Fo=
r now:<br>
+ =A0 =A0 =A0 =A0 =A0Check to see if the dirtied bytes is 50% more than the=
 approx.<br>
+ =A0 =A0 =A0 =A0 =A0amount of bytes that just got transferred since the la=
st time we<br>
+ =A0 =A0 =A0 =A0 =A0were in this routine. If that happens N times (for now=
 N=3D=3D5)<br>
+ =A0 =A0 =A0 =A0 =A0we turn on the throttle down logic */<br>
+ =A0 =A0 =A0 =A0 bytes_xfer_now =3D ram_bytes_transferred();<br>
+ =A0 =A0 =A0 =A0 if (s-&gt;dirty_pages_rate &amp;&amp;<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 ((num_dirty_pages_period*TARGET_PAGE_SIZE) &gt;<b=
r>
+ =A0 =A0 =A0 =A0 =A0 =A0 ((bytes_xfer_now - bytes_xfer_prev)/2))) {<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 if (dirty_rate_high_cnt++ &gt; 5) {<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 DPRINTF(&quot;Unable to converge. Throtti=
ng down guest\n&quot;);<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mig_throttle_on =3D 1;<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 }<br>
+ =A0 =A0 =A0 =A0}<br>
+ =A0 =A0 =A0 =A0bytes_xfer_prev =3D bytes_xfer_now;<br>
+<br>
=A0 =A0 =A0 =A0 =A0s-&gt;dirty_pages_rate =3D num_dirty_pages_period * 1000=
<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0/ (end_time - start_time);<br>
=A0 =A0 =A0 =A0 =A0s-&gt;dirty_bytes_rate =3D s-&gt;dirty_pages_rate * TARG=
ET_PAGE_SIZE;<br>
@@ -496,6 +521,18 @@ static int ram_save_block(QEMUFile *f, bool last_stage=
)<br>
=A0 =A0 =A0return bytes_sent;<br>
=A0}<br>
<br>
+bool throttling_needed(void)<br>
+{<br>
+ =A0 =A0bool value;<br>
+<br>
+ =A0 =A0qemu_mutex_lock_mig_throttle();<br>
+ =A0 =A0value =3D mig_throttle_on;<br>
+ =A0 =A0qemu_mutex_unlock_mig_throttle();<br>
+<br>
+ =A0 =A0if (value) {<br>
+ =A0 =A0 =A0 =A0return true;<br>
+ =A0 =A0}<br>
+ =A0 =A0return false;<br>
+}<br>
+<br>
=A0static uint64_t bytes_transferred;<br>
<br>
=A0static ram_addr_t ram_save_remaining(void)<br>
diff --git a/cpus.c b/cpus.c<br>
index 5a98a37..eea6601 100644<br>
--- a/cpus.c<br>
+++ b/cpus.c<br>
@@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void)<br>
=A0#endif /* _WIN32 */<br>
<br>
=A0static QemuMutex qemu_global_mutex;<br>
+static QemuMutex qemu_mig_throttle_mutex;<br>
=A0static QemuCond qemu_io_proceeded_cond;<br>
=A0static bool iothread_requesting_mutex;<br>
<br>
@@ -638,6 +639,7 @@ void qemu_init_cpu_loop(void)<br>
=A0 =A0 =A0qemu_cond_init(&amp;qemu_work_cond);<br>
=A0 =A0 =A0qemu_cond_init(&amp;qemu_io_proceeded_cond);<br>
=A0 =A0 =A0qemu_mutex_init(&amp;qemu_global_mutex);<br>
+ =A0 =A0qemu_mutex_init(&amp;qemu_mig_throttle_mutex);<br>
<br>
=A0 =A0 =A0qemu_thread_get_self(&amp;io_thread);<br>
=A0}<br>
@@ -923,6 +925,16 @@ static bool qemu_in_vcpu_thread(void)<br>
=A0 =A0 =A0return cpu_single_env &amp;&amp; qemu_cpu_is_self(ENV_GET_CPU(cp=
u_single_env));<br>
=A0}<br>
<br>
+void qemu_mutex_lock_mig_throttle(void)<br>
+{<br>
+ =A0 =A0qemu_mutex_lock(&amp;qemu_mig_throttle_mutex);<br>
+}<br>
+<br>
+void qemu_mutex_unlock_mig_throttle(void)<br>
+{<br>
+ =A0 =A0qemu_mutex_unlock(&amp;qemu_mig_throttle_mutex);<br>
+}<br>
+<br>
=A0void qemu_mutex_lock_iothread(void)<br>
=A0{<br>
=A0 =A0 =A0if (!tcg_enabled()) {<br>
diff --git a/include/migration/migration.h b/include/migration/migration.h<=
br>
index e2acec6..cccee91 100644<br>
--- a/include/migration/migration.h<br>
+++ b/include/migration/migration.h<br>
@@ -92,6 +92,15 @@ uint64_t ram_bytes_remaining(void);<br>
=A0uint64_t ram_bytes_transferred(void);<br>
=A0uint64_t ram_bytes_total(void);<br>
<br>
+#ifndef _QEMU_MIG_THROTTLE<br>
+#define _QEMU_MIG_THROTTLE<br>
+<br>
+bool throttling_needed(void);<br>
+bool throttling_now(void);<br>
+void *migration_throttle_down(void *);<br>
+<br>
+#endif<br>
+<br>
=A0extern SaveVMHandlers savevm_ram_handlers;<br>
<br>
=A0uint64_t dup_mig_bytes_transferred(void);<br>
diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h<br>
index 6f0200a..9a3886d 100644<br>
--- a/include/qemu/main-loop.h<br>
+++ b/include/qemu/main-loop.h<br>
@@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);<br>
=A0 */<br>
=A0void qemu_mutex_unlock_iothread(void);<br>
<br>
+void qemu_mutex_lock_mig_throttle(void);<br>
+void qemu_mutex_unlock_mig_throttle(void);<br>
+<br>
=A0/* internal interfaces */<br>
<br>
=A0void qemu_fd_register(int fd);<br>
diff --git a/kvm-all.c b/kvm-all.c<br>
index 2d92721..95010ce 100644<br>
--- a/kvm-all.c<br>
+++ b/kvm-all.c<br>
@@ -33,6 +33,8 @@<br>
=A0#include &quot;exec/memory.h&quot;<br>
=A0#include &quot;exec/address-spaces.h&quot;<br>
=A0#include &quot;qemu/event_notifier.h&quot;<br>
+#include &quot;sysemu/cpus.h&quot;<br>
+#include &quot;migration/migration.h&quot;<br>
<br>
=A0/* This check must be after config-host.h is included */<br>
=A0#ifdef CONFIG_EVENTFD<br>
@@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites=
[] =3D {<br>
=A0 =A0 =A0KVM_CAP_LAST_INFO<br>
=A0};<br>
<br>
+static void mig_delay_vcpu(void);<br>
+<br>
=A0static KVMSlot *kvm_alloc_slot(KVMState *s)<br>
=A0{<br>
=A0 =A0 =A0int i;<br>
@@ -1609,6 +1613,10 @@ int kvm_cpu_exec(CPUArchState *env)<br>
=A0 =A0 =A0 =A0 =A0}<br>
=A0 =A0 =A0 =A0 =A0qemu_mutex_unlock_iothread();<br>
<br>
+ =A0 =A0 =A0 =A0if (throttling_needed()) {<br>
+ =A0 =A0 =A0 =A0 =A0 =A0mig_delay_vcpu();<br>
+ =A0 =A0 =A0 =A0}<br>
+<br>
=A0 =A0 =A0 =A0 =A0run_ret =3D kvm_vcpu_ioctl(cpu, KVM_RUN, 0);<br>
<br>
=A0 =A0 =A0 =A0 =A0qemu_mutex_lock_iothread();<br>
@@ -2032,3 +2040,44 @@ int kvm_on_sigbus(int code, void *addr)<br>
=A0{<br>
=A0 =A0 =A0return kvm_arch_on_sigbus(code, addr);<br>
=A0}<br>
+<br>
+static bool throttling;<br>
+bool throttling_now(void)<br>
+{<br>
+ =A0 =A0if (throttling) {<br>
+ =A0 =A0 =A0 =A0return true;<br>
+ =A0 =A0}<br>
+ =A0 =A0return false;<br>
+}<br>
+<br>
+static void mig_delay_vcpu(void)<br>
+{<br>
+ =A0 =A0g_usleep(50*1000);<br>
+}<br>
+<br>
+/* Stub used for getting the vcpu out of VM and into qemu via<br>
+ =A0 run_on_cpu()*/<br>
+static void mig_kick_cpu(void *opq)<br>
+{<br>
+ =A0 =A0return;<br>
+}<br>
+<br>
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending<br=
>
+ =A0 much time in the VM. The migration thread will try to catchup.<br>
+ =A0 Workload will experience a greater performance drop but for a shorter=
<br>
+ =A0 duration.<br>
+*/<br>
+void *migration_throttle_down(void *opaque)<br>
+{<br>
+ =A0 =A0throttling =3D true;<br>
+ =A0 =A0while (throttling_needed()) {<br>
+ =A0 =A0 =A0 =A0CPUArchState *penv =3D first_cpu;<br>
+ =A0 =A0 =A0 =A0while (penv) {<br>
+ =A0 =A0 =A0 =A0 =A0 =A0qemu_mutex_lock_iothread();<br>
+ =A0 =A0 =A0 =A0 =A0 =A0run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);=
<br>
+ =A0 =A0 =A0 =A0 =A0 =A0qemu_mutex_unlock_iothread();<br>
+ =A0 =A0 =A0 =A0 =A0 =A0penv =3D penv-&gt;next_cpu;<br>
+ =A0 =A0 =A0 =A0}<br>
+ =A0 =A0 =A0 =A0g_usleep(25*1000);<br>
+ =A0 =A0}<br>
+ =A0 =A0throttling =3D false;<br>
+ =A0 =A0return NULL;<br>
+}<br>
diff --git a/migration.c b/migration.c<br>
index 3eb0fad..a464afc 100644<br>
--- a/migration.c<br>
+++ b/migration.c<br>
@@ -24,6 +24,7 @@<br>
=A0#include &quot;qemu/thread.h&quot;<br>
=A0#include &quot;qmp-commands.h&quot;<br>
=A0#include &quot;trace.h&quot;<br>
+#include &quot;sysemu/cpus.h&quot;<br>
<br>
=A0//#define DEBUG_MIGRATION<br>
<br>
@@ -503,6 +504,7 @@ static void *migration_thread(void *opaque)<br>
=A0 =A0 =A0int64_t max_size =3D 0;<br>
=A0 =A0 =A0int64_t start_time =3D initial_time;<br>
=A0 =A0 =A0bool old_vm_running =3D false;<br>
+ =A0 =A0QemuThread thread;<br>
<br>
=A0 =A0 =A0DPRINTF(&quot;beginning savevm\n&quot;);<br>
=A0 =A0 =A0qemu_savevm_state_begin(s-&gt;file, &amp;s-&gt;params);<br>
@@ -517,6 +519,10 @@ static void *migration_thread(void *opaque)<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0DPRINTF(&quot;pending size %lu max %lu\n&quot;, =
pending_size, max_size);<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0if (pending_size &amp;&amp; pending_size &gt;=3D=
 max_size) {<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0qemu_savevm_state_iterate(s-&gt;file);<b=
r>
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (throttling_needed() &amp;&amp; !thrott=
ling_now()) {<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0qemu_thread_create(&amp;thread, mi=
gration_throttle_down,<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 NULL, QEMU_TH=
READ_DETACHED);<br>
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0} else {<br>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0DPRINTF(&quot;done iterating\n&quot;);<b=
r>
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0qemu_mutex_lock_iothread();<br>
<span class=3D"HOEnZb"><font color=3D"#888888">--<br>
1.7.1<br>
<br>
<br>
</font></span></blockquote></div><br></div></div>

--001a11c2e93c6126b704db25c6f3--