From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40743)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jitendra.kolhe@hpe.com>) id 1aeODq-0003uC-Dg
	for qemu-devel@nongnu.org; Fri, 11 Mar 2016 09:39:26 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jitendra.kolhe@hpe.com>) id 1aeODl-0006bR-AL
	for qemu-devel@nongnu.org; Fri, 11 Mar 2016 09:39:22 -0500
Received: from g1t6213.austin.hp.com ([15.73.96.121]:57358)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jitendra.kolhe@hpe.com>) id 1aeODl-0006bL-40
	for qemu-devel@nongnu.org; Fri, 11 Mar 2016 09:39:17 -0500
References: <1457082167-12254-1-git-send-email-jitendra.kolhe@hpe.com>
	<20160310094912.GC9715@rkaganb.sw.ru> <56E25EBF.6050109@hpe.com>
	<F2CBF3009FA73547804AE4C663CAB28E0414B315@shsmsx102.ccr.corp.intel.com>
	<56E29BD9.8010306@hpe.com>
	<F2CBF3009FA73547804AE4C663CAB28E0414B554@shsmsx102.ccr.corp.intel.com>
From: Jitendra Kolhe <jitendra.kolhe@hpe.com>
Message-ID: <56E2D88D.2060702@hpe.com>
Date: Fri, 11 Mar 2016 20:09:09 +0530
MIME-Version: 1.0
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414B554@shsmsx102.ccr.corp.intel.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages
 released by virtio-balloon driver.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Li, Liang Z" <liang.z.li@intel.com>, Roman Kagan <rkagan@virtuozzo.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "dgilbert@redhat.com" <dgilbert@redhat.com>, "simhan@hpe.com" <simhan@hpe.com>, "mohan_parthasarathy@hpe.com" <mohan_parthasarathy@hpe.com>

On 3/11/2016 4:24 PM, Li, Liang Z wrote:
>>>>> I wonder if it is the scanning for zeros or sending the whiteout
>>>>> which affects the total migration time more.  If it is the former
>>>>> (as I would
>>>>> expect) then a rather local change to is_zero_range() to make use o=
f
>>>>> the mapping information before scanning would get you all the
>>>>> speedups without protocol changes, interfering with postcopy etc.
>>>>>
>>>>> Roman.
>>>>>
>>>>
>>>> Localizing the solution to zero page scan check is a good idea. I to=
o
>>>> agree that most of the time is send in scanning for zero page in
>>>> which case we should be able to localize solution to is_zero_range()=
.
>>>> However in case of ballooned out pages (which can be seen as a subse=
t
>>>> of guest zero pages) we also spend a very small portion of total
>>>> migration time in sending the control information, which can be also
>> avoided.
>>>>    From my tests for 16GB idle guest of which 12GB was ballooned out=
,
>>>> the zero page scan time for 12GB ballooned out pages was ~1789 ms an=
d
>>>> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
>>>> pages was ~556 ms. Total migration time was ~8000 ms
>>>
>>> How did you do the tests? ~ 556ms seems too long for putting several
>> bytes to the buffer.
>>> It's likely the time you measured contains the portion to processes t=
he
>> other 4GB guest memory pages.
>>>
>>> Liang
>>>
>>
>> I modified save_zero_page() as below and updated timers only for ballo=
oned
>> out pages so is_zero_page() should return true(also
>> qemu_balloon_bitmap_test() from my patchset returned 1) With below
>> instrumentation, I got t1 =3D ~1789ms and t2 =3D ~556ms. Also the tota=
l migration
>> time noted (~8000ms) is for unmodified qemu source.
>
> You mean the total live migration time for the unmodified qemu and the =
'you modified for test' qemu
> are almost the same?
>

Not sure I understand the question, but if 'you modified for test' means=20
below modifications to save_zero_page(), then answer is no. Here is what=20
I tried, let=92s say we have 3 versions of qemu (below timings are for=20
16GB idle guest with 12GB ballooned out)

v1. Unmodified qemu =96 absolutely not code change =96 Total Migration ti=
me=20
=3D ~7600ms (I rounded this one to ~8000ms)
v2. Modified qemu 1 =96 with proposed patch set (which skips both zero=20
pages scan and migrating control information for ballooned out pages) -=20
Total Migration time =3D ~5700ms
v3. Modified qemu 2 =96 only with changes to save_zero_page() as discusse=
d=20
in previous mail (and of course using proposed patch set only to=20
maintain bitmap for ballooned out pages) =96 Total migration time is=20
irrelevant in this case.
Total Zero page scan time =3D ~1789ms
Total (save_page_header + qemu_put_byte(f, 0)) =3D ~556ms.
Everything seems to add up here (may not be exact) =96 5700+1789+559 =3D =
~8000ms

I see 2 factors that we have not considered in this add up a. overhead=20
for migrating balloon bitmap to target and b. as you mentioned below=20
overhead of qemu_clock_get_ns().

>> It seems to addup to final migration time with proposed patchset.
>>
>> Here is the last entry for "another round" of test, this time its ~547=
ms
>> JK: block=3D7f5417a345e0, offset=3D3ffe42020, zero_page_scan_time=3D12=
18 us,
>> save_page_header_time=3D184 us, total_save_zero_page_time=3D1453 us
>> cumulated vals: zero_page_scan_time=3D1723920378 us,
>> save_page_header_time=3D547514618 us,
>> total_save_zero_page_time=3D2371059239 us
>>
>> static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t off=
set,
>>                             uint8_t *p, uint64_t *bytes_transferred) {
>>       int pages =3D -1;
>>       int64_t time1, time2, time3, time4;
>>       static int64_t t1 =3D 0, t2 =3D 0, t3 =3D 0;
>>
>>       time1 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>       if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>>           time2 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>           acct_info.dup_pages++;
>>           *bytes_transferred +=3D save_page_header(f, block,
>>                                                  offset | RAM_SAVE_FLA=
G_COMPRESS);
>>           qemu_put_byte(f, 0);
>>           time3 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>           *bytes_transferred +=3D 1;
>>           pages =3D 1;
>>       }
>>       time4 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>
>>       if (qemu_balloon_bitmap_test(block, offset) =3D=3D 1) {
>>           t1 +=3D (time2-time1);
>>           t2 +=3D (time3-time2);
>>           t3 +=3D (time4-time1);
>>           fprintf(stderr, "block=3D%lx, offset=3D%lx, zero_page_scan_t=
ime=3D%ld us,
>> save_page_header_time=3D%ld us, total_save_zero_page_time=3D%ld us\n"
>>                           "cumulated vals: zero_page_scan_time=3D%ld u=
s,
>> save_page_header_time=3D%ld us, total_save_zero_page_time=3D%ld us\n",
>>                            (unsigned long)block, (unsigned long)offset=
,
>>                            (time2-time1), (time3-time2), (time4-time1)=
, t1, t2, t3);
>>       }
>>       return pages;
>> }
>>
>
> Thanks for your  description.
> The issue here is that there are too many qemu_clock_get_ns() call,  th=
e cost of the function
> itself may become the main time consuming operation.  You can measure t=
he time consumed
> by  the qemu_clock_get_ns() you added for test by comparing the result =
with the version
> which not add the qemu_clock_get_ns().
>
> Liang
>

Yes, we can try to measure overhead for qemu_clock_get_ns() calls and=20
see if things add up perfectly.

Thanks,
- Jitendra