From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40743) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aeODq-0003uC-Dg for qemu-devel@nongnu.org; Fri, 11 Mar 2016 09:39:26 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aeODl-0006bR-AL for qemu-devel@nongnu.org; Fri, 11 Mar 2016 09:39:22 -0500 Received: from g1t6213.austin.hp.com ([15.73.96.121]:57358) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aeODl-0006bL-40 for qemu-devel@nongnu.org; Fri, 11 Mar 2016 09:39:17 -0500 References: <1457082167-12254-1-git-send-email-jitendra.kolhe@hpe.com> <20160310094912.GC9715@rkaganb.sw.ru> <56E25EBF.6050109@hpe.com> <56E29BD9.8010306@hpe.com> From: Jitendra Kolhe Message-ID: <56E2D88D.2060702@hpe.com> Date: Fri, 11 Mar 2016 20:09:09 +0530 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages released by virtio-balloon driver. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Li, Liang Z" , Roman Kagan , "qemu-devel@nongnu.org" , "dgilbert@redhat.com" , "simhan@hpe.com" , "mohan_parthasarathy@hpe.com" On 3/11/2016 4:24 PM, Li, Liang Z wrote: >>>>> I wonder if it is the scanning for zeros or sending the whiteout >>>>> which affects the total migration time more. If it is the former >>>>> (as I would >>>>> expect) then a rather local change to is_zero_range() to make use o= f >>>>> the mapping information before scanning would get you all the >>>>> speedups without protocol changes, interfering with postcopy etc. >>>>> >>>>> Roman. >>>>> >>>> >>>> Localizing the solution to zero page scan check is a good idea. I to= o >>>> agree that most of the time is send in scanning for zero page in >>>> which case we should be able to localize solution to is_zero_range()= . >>>> However in case of ballooned out pages (which can be seen as a subse= t >>>> of guest zero pages) we also spend a very small portion of total >>>> migration time in sending the control information, which can be also >> avoided. >>>> From my tests for 16GB idle guest of which 12GB was ballooned out= , >>>> the zero page scan time for 12GB ballooned out pages was ~1789 ms an= d >>>> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out >>>> pages was ~556 ms. Total migration time was ~8000 ms >>> >>> How did you do the tests? ~ 556ms seems too long for putting several >> bytes to the buffer. >>> It's likely the time you measured contains the portion to processes t= he >> other 4GB guest memory pages. >>> >>> Liang >>> >> >> I modified save_zero_page() as below and updated timers only for ballo= oned >> out pages so is_zero_page() should return true(also >> qemu_balloon_bitmap_test() from my patchset returned 1) With below >> instrumentation, I got t1 =3D ~1789ms and t2 =3D ~556ms. Also the tota= l migration >> time noted (~8000ms) is for unmodified qemu source. > > You mean the total live migration time for the unmodified qemu and the = 'you modified for test' qemu > are almost the same? > Not sure I understand the question, but if 'you modified for test' means=20 below modifications to save_zero_page(), then answer is no. Here is what=20 I tried, let=92s say we have 3 versions of qemu (below timings are for=20 16GB idle guest with 12GB ballooned out) v1. Unmodified qemu =96 absolutely not code change =96 Total Migration ti= me=20 =3D ~7600ms (I rounded this one to ~8000ms) v2. Modified qemu 1 =96 with proposed patch set (which skips both zero=20 pages scan and migrating control information for ballooned out pages) -=20 Total Migration time =3D ~5700ms v3. Modified qemu 2 =96 only with changes to save_zero_page() as discusse= d=20 in previous mail (and of course using proposed patch set only to=20 maintain bitmap for ballooned out pages) =96 Total migration time is=20 irrelevant in this case. Total Zero page scan time =3D ~1789ms Total (save_page_header + qemu_put_byte(f, 0)) =3D ~556ms. Everything seems to add up here (may not be exact) =96 5700+1789+559 =3D = ~8000ms I see 2 factors that we have not considered in this add up a. overhead=20 for migrating balloon bitmap to target and b. as you mentioned below=20 overhead of qemu_clock_get_ns(). >> It seems to addup to final migration time with proposed patchset. >> >> Here is the last entry for "another round" of test, this time its ~547= ms >> JK: block=3D7f5417a345e0, offset=3D3ffe42020, zero_page_scan_time=3D12= 18 us, >> save_page_header_time=3D184 us, total_save_zero_page_time=3D1453 us >> cumulated vals: zero_page_scan_time=3D1723920378 us, >> save_page_header_time=3D547514618 us, >> total_save_zero_page_time=3D2371059239 us >> >> static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t off= set, >> uint8_t *p, uint64_t *bytes_transferred) { >> int pages =3D -1; >> int64_t time1, time2, time3, time4; >> static int64_t t1 =3D 0, t2 =3D 0, t3 =3D 0; >> >> time1 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME); >> if (is_zero_range(p, TARGET_PAGE_SIZE)) { >> time2 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME); >> acct_info.dup_pages++; >> *bytes_transferred +=3D save_page_header(f, block, >> offset | RAM_SAVE_FLA= G_COMPRESS); >> qemu_put_byte(f, 0); >> time3 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME); >> *bytes_transferred +=3D 1; >> pages =3D 1; >> } >> time4 =3D qemu_clock_get_ns(QEMU_CLOCK_REALTIME); >> >> if (qemu_balloon_bitmap_test(block, offset) =3D=3D 1) { >> t1 +=3D (time2-time1); >> t2 +=3D (time3-time2); >> t3 +=3D (time4-time1); >> fprintf(stderr, "block=3D%lx, offset=3D%lx, zero_page_scan_t= ime=3D%ld us, >> save_page_header_time=3D%ld us, total_save_zero_page_time=3D%ld us\n" >> "cumulated vals: zero_page_scan_time=3D%ld u= s, >> save_page_header_time=3D%ld us, total_save_zero_page_time=3D%ld us\n", >> (unsigned long)block, (unsigned long)offset= , >> (time2-time1), (time3-time2), (time4-time1)= , t1, t2, t3); >> } >> return pages; >> } >> > > Thanks for your description. > The issue here is that there are too many qemu_clock_get_ns() call, th= e cost of the function > itself may become the main time consuming operation. You can measure t= he time consumed > by the qemu_clock_get_ns() you added for test by comparing the result = with the version > which not add the qemu_clock_get_ns(). > > Liang > Yes, we can try to measure overhead for qemu_clock_get_ns() calls and=20 see if things add up perfectly. Thanks, - Jitendra