From: Juan Quintela <quintela@redhat.com>
To: Chegu Vinod <chegu_vinod@hp.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org,
anthony@codemonkey.ws, owasserm@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
Date: Tue, 30 Apr 2013 17:20:47 +0200 [thread overview]
Message-ID: <877gjkm4ds.fsf@elfo.elfo> (raw)
In-Reply-To: <1367095836-19318-1-git-send-email-chegu_vinod@hp.com> (Chegu Vinod's message of "Sat, 27 Apr 2013 13:50:36 -0700")
Chegu Vinod <chegu_vinod@hp.com> wrote:
> Busy enterprise workloads hosted on large sized VM's tend to dirty
> memory faster than the transfer rate achieved via live guest migration.
> Despite some good recent improvements (& using dedicated 10Gig NICs
> between hosts) the live migration does NOT converge.
>
> A few options that were discussed/being-pursued to help with
> the convergence issue include:
>
> 1) Slow down guest considerably via cgroup's CPU controls - requires
> libvirt client support to detect & trigger action, but conceptually
> similar to this RFC change.
>
> 2) Speed up transfer rate:
> - RDMA based Pre-copy - lower overhead and fast (Unfortunately
> has a few restrictions and some customers still choose not
> to deploy RDMA :-( ).
> - Add parallelism to improve transfer rate and use multiple 10Gig
> connections (bonded). - could add some overhead on the host.
>
> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
> promising but need to consider & handle newer failure scenarios.
>
> If an enterprise user chooses to force convergence of their migration
> via the new capability "auto-converge" then with this change we auto-detect
> lack of convergence scenario and trigger a slow down of the workload
> by explicitly disallowing the VCPUs from spending much time in the VM
> context.
>
> The migration thread tries to catchup and this eventually leads
> to convergence in some "deterministic" amount of time. Yes it does
> impact the performance of all the VCPUs but in my observation that
> lasts only for a short duration of time. i.e. we end up entering
> stage 3 (downtime phase) soon after that.
>
> No exernal trigger is required (unlike option 1) and it can co-exist
> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>
> Thanks to Juan and Paolo for their useful suggestions.
>
> Verified the convergence using the following :
> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>
> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
> migrate downtime set to 4seconds).
>
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: off <----
> Migration status: active
> total time: 1487503 milliseconds
148 seconds
> expected downtime: 519 milliseconds
> transferred ram: 383749347 kbytes
> remaining ram: 2753372 kbytes
> total ram: 268444224 kbytes
> duplicate: 65461532 pages
> skipped: 64901568 pages
> normal: 95750218 pages
> normal bytes: 383000872 kbytes
> dirty pages rate: 67551 pages
>
> ---
>
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: on <----
> Migration status: completed
> total time: 241161 milliseconds
> downtime: 6373 milliseconds
6.3 seconds and finished, not bad at all O:-)
How much does the guest throughput drops while we enter autoconverge mode?
> transferred ram: 28235307 kbytes
> remaining ram: 0 kbytes
> total ram: 268444224 kbytes
> duplicate: 64946416 pages
> skipped: 64903523 pages
> normal: 7044971 pages
> normal bytes: 28179884 kbytes
>
> Changes from v1:
> - rebased to latest qemu.git
> - added auto-converge capability(default off) - suggested by Anthony Liguori &
> Eric Blake.
>
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
> MigrationState *s = migrate_get_current();
> static int64_t start_time;
> static int64_t num_dirty_pages_period;
> + static int64_t bytes_xfer_prev;
> int64_t end_time;
> + int64_t bytes_xfer_now;
> + static int dirty_rate_high_cnt;
> +
> + if (migrate_auto_converge() && !bytes_xfer_prev) {
Just do the !bytes_xfer_prev test here? migrate_autoconverge is more
expensive to call that just do the assignment?
> +
> + if (value) {
> + return true;
> + }
> + return false;
this code is just:
return value;
> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
> index 6f0200a..9a3886d 100644
> --- a/include/qemu/main-loop.h
> +++ b/include/qemu/main-loop.h
> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
> */
> void qemu_mutex_unlock_iothread(void);
>
> +void qemu_mutex_lock_mig_throttle(void);
> +void qemu_mutex_unlock_mig_throttle(void);
> +
> /* internal interfaces */
>
> void qemu_fd_register(int fd);
> diff --git a/kvm-all.c b/kvm-all.c
> index 2d92721..a92cb77 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -33,6 +33,8 @@
> #include "exec/memory.h"
> #include "exec/address-spaces.h"
> #include "qemu/event_notifier.h"
> +#include "sysemu/cpus.h"
> +#include "migration/migration.h"
>
> /* This check must be after config-host.h is included */
> #ifdef CONFIG_EVENTFD
> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
> KVM_CAP_LAST_INFO
> };
>
> +static void mig_delay_vcpu(void);
> +
move function definiton to here?
> +
> +static bool throttling;
> +bool throttling_now(void)
> +{
> + if (throttling) {
> + return true;
> + }
> + return false;
return throttling;
> +/* Stub used for getting the vcpu out of VM and into qemu via
> + run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> + return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> + much time in the VM. The migration thread will try to catchup.
> + Workload will experience a greater performance drop but for a shorter
> + duration.
> +*/
> +void *migration_throttle_down(void *opaque)
> +{
> + throttling = true;
> + while (throttling_needed()) {
> + CPUArchState *penv = first_cpu;
I am not sure that we can follow the list without the iothread lock
here.
> + while (penv) {
> + qemu_mutex_lock_iothread();
> + run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> + qemu_mutex_unlock_iothread();
> + penv = penv->next_cpu;
> + }
> + g_usleep(25*1000);
> + }
> + throttling = false;
> + return NULL;
> +}
next prev parent reply other threads:[~2013-04-30 15:20 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-27 20:50 [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge Chegu Vinod
2013-04-29 14:53 ` Eric Blake
2013-04-29 17:48 ` Chegu Vinod
2013-04-30 15:04 ` Orit Wasserman
2013-04-30 17:51 ` Chegu Vinod
2013-04-30 15:20 ` Juan Quintela [this message]
2013-04-30 15:55 ` Chegu Vinod
2013-04-30 16:01 ` Juan Quintela
2013-04-30 17:48 ` Chegu Vinod
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877gjkm4ds.fsf@elfo.elfo \
--to=quintela@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=chegu_vinod@hp.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).