From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:54561)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1UqwD2-0004Ma-Ua
	for qemu-devel@nongnu.org; Sun, 23 Jun 2013 22:08:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1UqwD1-0005TL-FW
	for qemu-devel@nongnu.org; Sun, 23 Jun 2013 22:08:48 -0400
Received: from g6t0184.atlanta.hp.com ([15.193.32.61]:5482)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1UqwD1-0005Sv-9f
	for qemu-devel@nongnu.org; Sun, 23 Jun 2013 22:08:47 -0400
Message-ID: <51C7AA27.5080306@hp.com>
Date: Sun, 23 Jun 2013 19:08:39 -0700
From: Chegu Vinod <chegu_vinod@hp.com>
MIME-Version: 1.0
References: <1371218337-51891-1-git-send-email-chegu_vinod@hp.com>
	<1371218337-51891-4-git-send-email-chegu_vinod@hp.com>
	<51C2FB6B.7090401@redhat.com>
In-Reply-To: <51C2FB6B.7090401@redhat.com>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH v6 3/3] Force auto-convegence of live
	migration
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: owasserm@redhat.com, qemu-devel@nongnu.org, anthony@codemonkey.ws, quintela@redhat.com

On 6/20/2013 5:54 AM, Paolo Bonzini wrote:
> Il 14/06/2013 15:58, Chegu Vinod ha scritto:
>> If a user chooses to turn on the auto-converge migration capability
>> these changes detect the lack of convergence and throttle down the
>> guest. i.e. force the VCPUs out of the guest for some duration
>> and let the migration thread catchup and help converge.
> Hi Vinod,
>
> pretty much the same comments I sent you yesterday on the obsolete
> version of the patch still apply.
>
>> Verified the convergence using the following :
>>   - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
>>   - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>
>> Sample results with Java warehouse workload : (migrate speed set to 20Gb and
>> migrate downtime set to 4seconds).
>>
>>   (qemu) info migrate
>>   capabilities: xbzrle: off auto-converge: off  <----
>>   Migration status: active
>>   total time: 1487503 milliseconds
>>   expected downtime: 519 milliseconds
>>   transferred ram: 383749347 kbytes
>>   remaining ram: 2753372 kbytes
>>   total ram: 268444224 kbytes
>>   duplicate: 65461532 pages
>>   skipped: 64901568 pages
>>   normal: 95750218 pages
>>   normal bytes: 383000872 kbytes
>>   dirty pages rate: 67551 pages
>>
>>   ---
>>
>>   (qemu) info migrate
>>   capabilities: xbzrle: off auto-converge: on   <----
>>   Migration status: completed
>>   total time: 241161 milliseconds
>>   downtime: 6373 milliseconds
>>   transferred ram: 28235307 kbytes
>>   remaining ram: 0 kbytes
>>   total ram: 268444224 kbytes
>>   duplicate: 64946416 pages
>>   skipped: 64903523 pages
>>   normal: 7044971 pages
>>   normal bytes: 28179884 kbytes
>>
>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>> ---
>>   arch_init.c |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 files changed, 85 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 5d32ecf..69c6c8c 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -104,6 +104,8 @@ int graphic_depth = 15;
>>   #endif
>>   
>>   const uint32_t arch_type = QEMU_ARCH;
>> +static bool mig_throttle_on;
>> +static void throttle_down_guest_to_converge(void);
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -378,8 +380,15 @@ static void migration_bitmap_sync(void)
>>       uint64_t num_dirty_pages_init = migration_dirty_pages;
>>       MigrationState *s = migrate_get_current();
>>       static int64_t start_time;
>> +    static int64_t bytes_xfer_prev;
>>       static int64_t num_dirty_pages_period;
>>       int64_t end_time;
>> +    int64_t bytes_xfer_now;
>> +    static int dirty_rate_high_cnt;
>> +
>> +    if (!bytes_xfer_prev) {
>> +        bytes_xfer_prev = ram_bytes_transferred();
>> +    }
>>   
>>       if (!start_time) {
>>           start_time = qemu_get_clock_ms(rt_clock);
>> @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
>>   
>>       /* more than 1 second = 1000 millisecons */
>>       if (end_time > start_time + 1000) {
>> +        if (migrate_auto_converge()) {
>> +            /* The following detection logic can be refined later. For now:
>> +               Check to see if the dirtied bytes is 50% more than the approx.
>> +               amount of bytes that just got transferred since the last time we
>> +               were in this routine. If that happens >N times (for now N==4)
>> +               we turn on the throttle down logic */
>> +            bytes_xfer_now = ram_bytes_transferred();
>> +            if (s->dirty_pages_rate &&
>> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
>> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
>> +                if (dirty_rate_high_cnt++ > 4) {
> Too many parentheses, and please remove the nested if.
>
>> +                    DPRINTF("Unable to converge. Throtting down guest\n");
> Please use tracepoint instead.
>
>> +                    mig_throttle_on = true;
> Need to reset dirty_rate_high_cnt here, and both
> dirty_rate_high_cnt/mig_throttle_on if you see !migrate_auto_converge().
>   This ensures that throttling does not kick in automatically if you
> disable and re-enable the feature.  It also lets you remove a bunch of
> migrate_auto_converge() checks.
>
> You also need to reset dirty_rate_high_cnt/mig_throttle_on in the setup
> phase of migration.
>
>> +                }
>> +             }
>> +             bytes_xfer_prev = bytes_xfer_now;
>> +        }
>>           s->dirty_pages_rate = num_dirty_pages_period * 1000
>>               / (end_time - start_time);
>>           s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
>> @@ -628,6 +654,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>           }
>>           total_sent += bytes_sent;
>>           acct_info.iterations++;
>> +        throttle_down_guest_to_converge();
> You can use a shorter name, like check_cpu_throttling().
>
>>           /* we want to check in the 1st loop, just in case it was the 1st time
>>              and we had to sync the dirty bitmap.
>>              qemu_get_clock_ns() is a bit expensive, so we only check each some
>> @@ -1098,3 +1125,61 @@ TargetInfo *qmp_query_target(Error **errp)
>>   
>>       return info;
>>   }
>> +
>> +static bool throttling_needed(void)
>> +{
>> +    if (!migrate_auto_converge()) {
>> +        return false;
>> +    }
>> +    return mig_throttle_on;
>> +}
>> +
>> +/* Stub function that's gets run on the vcpu when its brought out of the
>> +   VM to run inside qemu via async_run_on_cpu()*/
>> +static void mig_sleep_cpu(void *opq)
>> +{
>> +    qemu_mutex_unlock_iothread();
>> +    g_usleep(30*1000);
>> +    qemu_mutex_lock_iothread();
> Letting the user specify the entity of the pause would be nice, so that
> management can ramp it up.  It can be done as a follow-up by adding a
> '*value': 'int' field to MigrationCapabilityStatus (between 0 and 100,
> default 30 as above).

Thanks Paolo.  With the exception of the above which can be pursued as a 
follow-up... I have incorporated all your suggested changes and 
re-tested and sending out a v7 series (without RFC).

Thanks!
Vinod


> Paolo
>
>> +}
>> +
>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>> +   much time in the VM. The migration thread will try to catchup.
>> +   Workload will experience a performance drop.
>> +*/
>> +static void mig_throttle_cpu_down(CPUState *cpu, void *data)
>> +{
>> +    async_run_on_cpu(cpu, mig_sleep_cpu, NULL);
>> +}
>> +
>> +static void mig_throttle_guest_down(void)
>> +{
>> +    if (throttling_needed()) {
> No need for this "if", it is done already in the caller.
>
>> +        qemu_mutex_lock_iothread();
>> +        qemu_for_each_cpu(mig_throttle_cpu_down, NULL);
>> +        qemu_mutex_unlock_iothread();
>> +    }
>> +}
>> +
>> +static void throttle_down_guest_to_converge(void)
>> +{
>> +    static int64_t t0;
>> +    int64_t        t1;
>> +
>> +    if (!throttling_needed()) {
> With the above suggested changes, this can simply check mig_throttle_on.
>
>> +        return;
>> +    }
>> +
>> +    if (!t0)  {
>> +        t0 = qemu_get_clock_ns(rt_clock);
>> +        return;
>> +    }
>> +
>> +    t1 = qemu_get_clock_ns(rt_clock);
>> +
>> +    /* If it has been more than 40 ms since the last time the guest
>> +     * was throtled then do it again.
>> +     */
> throttled
>
>> +    if (((t1-t0)/1000000) > 40) {
> I prefer moving the multiplication to the right so you don't need
> parentheses, but this is _really_ a nit...
>
> Paolo
>
>> +        mig_throttle_guest_down();
>> +        t0 = t1;
>> +    }
>> +}
>>
> .
>