From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:36942) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UXDCd-00041O-H0 for qemu-devel@nongnu.org; Tue, 30 Apr 2013 12:14:56 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UXDCY-0003XC-DR for qemu-devel@nongnu.org; Tue, 30 Apr 2013 12:14:51 -0400 Received: from g6t0186.atlanta.hp.com ([15.193.32.63]:22545) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UXCti-0005n8-VP for qemu-devel@nongnu.org; Tue, 30 Apr 2013 11:55:19 -0400 Message-ID: <517FE964.2050702@hp.com> Date: Tue, 30 Apr 2013 08:55:16 -0700 From: Chegu Vinod MIME-Version: 1.0 References: <1367095836-19318-1-git-send-email-chegu_vinod@hp.com> <877gjkm4ds.fsf@elfo.elfo> In-Reply-To: <877gjkm4ds.fsf@elfo.elfo> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: quintela@redhat.com Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, anthony@codemonkey.ws, owasserm@redhat.com On 4/30/2013 8:20 AM, Juan Quintela wrote: > Chegu Vinod wrote: >> Busy enterprise workloads hosted on large sized VM's tend to dirty >> memory faster than the transfer rate achieved via live guest migration. >> Despite some good recent improvements (& using dedicated 10Gig NICs >> between hosts) the live migration does NOT converge. >> >> A few options that were discussed/being-pursued to help with >> the convergence issue include: >> >> 1) Slow down guest considerably via cgroup's CPU controls - requires >> libvirt client support to detect & trigger action, but conceptually >> similar to this RFC change. >> >> 2) Speed up transfer rate: >> - RDMA based Pre-copy - lower overhead and fast (Unfortunately >> has a few restrictions and some customers still choose not >> to deploy RDMA :-( ). >> - Add parallelism to improve transfer rate and use multiple 10Gig >> connections (bonded). - could add some overhead on the host. >> >> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds >> promising but need to consider & handle newer failure scenarios. >> >> If an enterprise user chooses to force convergence of their migration >> via the new capability "auto-converge" then with this change we auto-detect >> lack of convergence scenario and trigger a slow down of the workload >> by explicitly disallowing the VCPUs from spending much time in the VM >> context. >> >> The migration thread tries to catchup and this eventually leads >> to convergence in some "deterministic" amount of time. Yes it does >> impact the performance of all the VCPUs but in my observation that >> lasts only for a short duration of time. i.e. we end up entering >> stage 3 (downtime phase) soon after that. >> >> No exernal trigger is required (unlike option 1) and it can co-exist >> with enhancements being pursued as part of Option 2 (e.g. RDMA). >> >> Thanks to Juan and Paolo for their useful suggestions. >> >> Verified the convergence using the following : >> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) >> - OLTP like workload running on a 80VCPU/512G guest (~80% busy) >> >> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and >> migrate downtime set to 4seconds). >> >> (qemu) info migrate >> capabilities: xbzrle: off auto-converge: off <---- >> Migration status: active >> total time: 1487503 milliseconds > 148 seconds 1487 seconds and still the Migration is not completed. > >> expected downtime: 519 milliseconds >> transferred ram: 383749347 kbytes >> remaining ram: 2753372 kbytes >> total ram: 268444224 kbytes >> duplicate: 65461532 pages >> skipped: 64901568 pages >> normal: 95750218 pages >> normal bytes: 383000872 kbytes >> dirty pages rate: 67551 pages >> >> --- >> >> (qemu) info migrate >> capabilities: xbzrle: off auto-converge: on <---- >> Migration status: completed >> total time: 241161 milliseconds >> downtime: 6373 milliseconds > 6.3 seconds and finished, not bad at all O:-) That's the *downtime*.. The total time for migration to complete is 241 secs. (SpecJBB is one of those workloads that dirties memory quite a bit). > > How much does the guest throughput drops while we enter autoconverge mode? Workload performance drops for some short duration and it...but it soon switches to stage 3. > >> transferred ram: 28235307 kbytes >> remaining ram: 0 kbytes >> total ram: 268444224 kbytes >> duplicate: 64946416 pages >> skipped: 64903523 pages >> normal: 7044971 pages >> normal bytes: 28179884 kbytes >> >> Changes from v1: >> - rebased to latest qemu.git >> - added auto-converge capability(default off) - suggested by Anthony Liguori & >> Eric Blake. >> >> Signed-off-by: Chegu Vinod >> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void) >> MigrationState *s = migrate_get_current(); >> static int64_t start_time; >> static int64_t num_dirty_pages_period; >> + static int64_t bytes_xfer_prev; >> int64_t end_time; >> + int64_t bytes_xfer_now; >> + static int dirty_rate_high_cnt; >> + >> + if (migrate_auto_converge() && !bytes_xfer_prev) { > Just do the !bytes_xfer_prev test here? migrate_autoconverge is more > expensive to call that just do the assignment? Sure > >> + >> + if (value) { >> + return true; >> + } >> + return false; > this code is just: > > return value; ok > >> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h >> index 6f0200a..9a3886d 100644 >> --- a/include/qemu/main-loop.h >> +++ b/include/qemu/main-loop.h >> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void); >> */ >> void qemu_mutex_unlock_iothread(void); >> >> +void qemu_mutex_lock_mig_throttle(void); >> +void qemu_mutex_unlock_mig_throttle(void); >> + >> /* internal interfaces */ >> >> void qemu_fd_register(int fd); >> diff --git a/kvm-all.c b/kvm-all.c >> index 2d92721..a92cb77 100644 >> --- a/kvm-all.c >> +++ b/kvm-all.c >> @@ -33,6 +33,8 @@ >> #include "exec/memory.h" >> #include "exec/address-spaces.h" >> #include "qemu/event_notifier.h" >> +#include "sysemu/cpus.h" >> +#include "migration/migration.h" >> >> /* This check must be after config-host.h is included */ >> #ifdef CONFIG_EVENTFD >> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = { >> KVM_CAP_LAST_INFO >> }; >> >> +static void mig_delay_vcpu(void); >> + > move function definiton to here? Ok. >> + >> +static bool throttling; >> +bool throttling_now(void) >> +{ >> + if (throttling) { >> + return true; >> + } >> + return false; > return throttling; > >> +/* Stub used for getting the vcpu out of VM and into qemu via >> + run_on_cpu()*/ >> +static void mig_kick_cpu(void *opq) >> +{ >> + return; >> +} >> + >> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending >> + much time in the VM. The migration thread will try to catchup. >> + Workload will experience a greater performance drop but for a shorter >> + duration. >> +*/ >> +void *migration_throttle_down(void *opaque) >> +{ >> + throttling = true; >> + while (throttling_needed()) { >> + CPUArchState *penv = first_cpu; > I am not sure that we can follow the list without the iothread lock > here. Hmm.. Is this due to vcpu hot plug that might happen at the time of live migration (or) due to something else ? I was trying to avoid holding the iothread lock for longer duration and slow down the migration thread... > >> + while (penv) { >> + qemu_mutex_lock_iothread(); >> + run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); >> + qemu_mutex_unlock_iothread(); >> + penv = penv->next_cpu; >> + } >> + g_usleep(25*1000); >> + } >> + throttling = false; >> + return NULL; >> +} > . Thanks Vinod