From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:37060) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SdvkB-0005J4-Op for qemu-devel@nongnu.org; Sun, 10 Jun 2012 23:56:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Sdvk8-0003Tq-QV for qemu-devel@nongnu.org; Sun, 10 Jun 2012 23:56:43 -0400 Received: from g6t0184.atlanta.hp.com ([15.193.32.61]:29160) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Sdvk8-0003Tc-Ic for qemu-devel@nongnu.org; Sun, 10 Jun 2012 23:56:40 -0400 Message-ID: <4FD56C73.8080708@hp.com> Date: Sun, 10 Jun 2012 20:56:35 -0700 From: Chegu Vinod MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC 0/7] Fix migration with lots of memory Reply-To: chegu_vinod@hp.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Juan Quintela Cc: qemu-devel@nongnu.org Hello, I did pick up these patches a while back and did run some migration tests while running simple workloads in the guest. Below are some results. FYI... Vinod ---- Config Details: Guest 10vcps, 60GB (running on a host that is 6cores(12threads) and 64GB). The hosts are identical X86_64 Blade servers& are connected via a private 10G link (for migration traffic) Guest was started using qemu (no virsh/virt-manager etc). Migration was initiated at the qemu monitor prompt and the migration_set_speed was used to set to 10G. No changes to the downtime. Software: - Guest& the Host OS : 3.4.0-rc7+ - Vanilla : basic upstream qemu.git - huge_memory changes(Juan's qemu.git tree) [ Note : BTW, 'did also try vers:11 of XBZRLE patches...but ran into issues (guest crashed after migration) 'have reported it to the author] Here are the simple "workloads" and results: 1) Idling guest 2) AIM7-compute (with 2000 users). 3) 10way parallel make (of the kernel) 4) 2 instances of memory r/w loop (exactly the same as in docs/xbzrle.txt) 5) SPECJbb2005 Note: In the Vanilla case I had instrumented ram_save_live() to print out the total migration time and the MB's transferred. 1) Idling guest: Vanilla : Total Mig. time: 173016 ms Total MB's transferred : 1606MB huge_memory: Total Mig. time: 48821 ms Total MB's transferred : 1620 MB 2) AIM7-compute (2000 users) Vanilla : Total Mig. time: 241124 ms Total MB's transferred : 4827MB huge_memory: Total Mig. time: 66716 ms Total MB's transferred : 4022MB 3) 10 way parallel make: (of the linux kernel) Vanilla : Total Mig. time: 104319 ms Total MB's transferred : 2316MB huge_memory: Total Mig. time: 55105 ms Total MB's transferred : 2995MB 4) 2 instances of Memory r/w loop: (refer to docs/xbzrle.txt) Vanilla : Total Mig. time: 112102 ms Total MB's transferred : 1739MB huge_memory: Total Mig. time: 85504ms Total MB's transferred : 1745MB 5) SPECJbb : Vanilla : Total Mig. time: 162189 ms Total MB's transferred : 5461MB huge_memory: Total Mig. time: 67787 ms Total MB's transferred : 8528MB [Expected] Observation : Unlike with the Vanilla case(& also the XBZRLE case), with these patches I was still able to interact with the qemu monitor prompt and also interact with the guest during the migration (i.e. during the iterative pre-copy phase). ------ On 5/22/2012 11:32 AM, Juan Quintela wrote: > Hi > > After a long, long time, this is v2. > > This are basically the changes that we have for RHEL, due to the > problems that we have with big memory machines. I just rebased the > patches and fixed the easy parts: > > - buffered_file_limit is gone: we just use 50ms and call it a day > > - I let ram_addr_t as a valid type for a counter (no, I still don't > agree with Anthony on this, but it is not important). > > - Print total time of migration always. Notice that I also print it > when migration is completed. Luiz, could you take a look to see if > I did something worng (probably). > > - Moved debug printfs to tracepointns. Thanks a lot to Stefan for > helping with it. Once here, I had to put the traces in the middle > of trace-events file, if I put them on the end of the file, when I > enable them, I got generated the previous two tracepoints, instead > of the ones I just defined. Stefan is looking on that. Workaround > is defining them anywhere else. > > - exit from cpu_physical_memory_reset_dirty(). Anthony wanted that I > created an empty stub for kvm, and maintain the code for tcg. The > problem is that we can have both kvm and tcg running from the same > binary. Intead of exiting in the middle of the function, I just > refactored the code out. Is there an struct where I could add a new > function pointer for this behaviour? > > - exit if we have been too long on ram_save_live() loop. Anthony > didn't like this, I will sent a version based on the migration > thread in the following days. But just need something working for > other people to test. > > Notice that I still got "lots" of more than 50ms printf's. (Yes, > there is a debugging printf there). > > - Bitmap handling. Still all code to count dirty pages, will try to > get something saner based on bitmap optimizations. > > Comments? > > Later, Juan. > > > > > v1: > --- > > Executive Summary > ----------------- > > This series of patches fix migration with lots of memory. With them stalls > are removed, and we honored max_dowtime. > I also add infrastructure to measure what is happening during migration > (#define DEBUG_MIGRATION and DEBUG_SAVEVM). > > Migration is broken at the momment in qemu tree, Michael patch is needed to > fix virtio migration. Measurements are given for qemu-kvm tree. At the end, some measurements > with qemu tree. > > Long Version with measurements (for those that like numbers O:-) > ------------------------------ > > 8 vCPUS and 64GB RAM, a RHEL5 guest that is completelly idle > > initial > ------- > > savevm: save live iterate section id 3 name ram took 3266 milliseconds 46 times > > We have 46 stalls, and missed the 100ms deadline 46 times. > stalls took around 3.5 and 3.6 seconds each. > > savevm: save devices took 1 milliseconds > > if you had any doubt (rest of devices, not RAM) took less than 1ms, so > we don't care for now to optimize them. > > migration: ended after 207411 milliseconds > > total migration took 207 seconds for this guest > > samples % image name symbol name > 2161431 72.8297 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 379416 12.7845 qemu-system-x86_64 ram_save_live > 367880 12.3958 qemu-system-x86_64 ram_save_block > 16647 0.5609 qemu-system-x86_64 qemu_put_byte > 10416 0.3510 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 9013 0.3037 qemu-system-x86_64 qemu_put_be32 > > Clearly, we are spending too much time on cpu_physical_memory_reset_dirty. > > ping results during the migration. > > rtt min/avg/max/mdev = 474.395/39772.087/151843.178/55413.633 ms, pipe 152 > > You can see that the mean and maximun values are quite big. > > We got in the guests the dreade: CPU softlookup for 10s > > No need to iterate if we already are over the limit > --------------------------------------------------- > > Numbers similar to previous ones. > > KVM don't care about TLB handling > --------------------------------- > > savevm: save livne iterate section id 3 name ram took 466 milliseconds 56 times > > 56 stalls, but much smaller, betweenn 0.5 and 1.4 seconds > > migration: ended after 115949 milliseconds > > total time has improved a lot. 115 seconds. > > samples % image name symbol name > 431530 52.1152 qemu-system-x86_64 ram_save_live > 355568 42.9414 qemu-system-x86_64 ram_save_block > 14446 1.7446 qemu-system-x86_64 qemu_put_byte > 11856 1.4318 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 3281 0.3962 qemu-system-x86_64 qemu_put_be32 > 2426 0.2930 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 2180 0.2633 qemu-system-x86_64 qemu_put_be64 > > notice how cpu_physical_memory_dirty() use much less time. > > rtt min/avg/max/mdev = 474.438/1529.387/15578.055/2595.186 ms, pipe 16 > > ping values from outside to the guest have improved a bit, but still > bad. > > Exit loop if we have been there too long > ---------------------------------------- > > not a single stall bigger than 100ms > > migration: ended after 157511 milliseconds > > not as good time as previous one, but we have removed the stalls. > > samples % image name symbol name > 1104546 71.8260 qemu-system-x86_64 ram_save_live > 370472 24.0909 qemu-system-x86_64 ram_save_block > 30419 1.9781 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 16252 1.0568 qemu-system-x86_64 qemu_put_byte > 3400 0.2211 qemu-system-x86_64 qemu_put_be32 > 2657 0.1728 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 2206 0.1435 qemu-system-x86_64 qemu_put_be64 > 1559 0.1014 qemu-system-x86_64 qemu_file_rate_limit > > > You can see that ping times are improving > rtt min/avg/max/mdev = 474.422/504.416/628.508/35.366 ms > > now the maximun is near the minimum, in reasonable values. > > The limit in the loop in stage loop has been put into 50ms because > buffered_file run a timer each 100ms. If we miss that timer, we ended > having trouble. So, I put 100/2. > > I tried other values: 15ms (max_downtime/2, so it could be set by the > user), but gave too much total time (~400seconds). > > I tried bigger values, 75ms and 100ms, but with any of them we got > stalls, some times as big as 1s, as we loss some timer run, and then > calculations are wrong. > > With this patch, the softlookups are gone. > > Change calculation to exit live migration > ----------------------------------------- > > we spent too much time on ram_save_live(), the problem is the > calculation of number of dirty pages (ram_save_remaining()). Instead > of walking the bitmap each time that we need the value, we just > maintain the number of dirty pages each time that we change one value > in the bitmap. > > migration: ended after 151187 milliseconds > > same total time. > > samples % image name symbol name > 365104 84.1659 qemu-system-x86_64 ram_save_block > 32048 7.3879 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 16033 3.6960 qemu-system-x86_64 qemu_put_byte > 3383 0.7799 qemu-system-x86_64 qemu_put_be32 > 3028 0.6980 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 2174 0.5012 qemu-system-x86_64 qemu_put_be64 > 1953 0.4502 qemu-system-x86_64 ram_save_live > 1408 0.3246 qemu-system-x86_64 qemu_file_rate_limit > > time is spent in ram_save_block() as expected. > > rtt min/avg/max/mdev = 474.412/492.713/539.419/21.896 ms > > std deviation is still better than without this. > > > and now, with load on the guest!!! > ---------------------------------- > > will show only without my patches applied, and at the end (as with > load it takes more time to run the tests). > > load is synthetic: > > stress -c 2 -m 4 --vm-bytes 256M > > (2 cpu threads and two memory threads dirtying each 256MB RAM) > > Notice that we are dirtying too much memory to be able to migrate with > the default downtime of 30ms. What the migration should do is loop over > but without having stalls. To get the migration ending, I just kill the > stress process after several iterations through all memory. > > initial > ------- > > same stalls that without load (stalls are caused when it finds lots of > contiguous zero pages). > > > samples % image name symbol name > 2328320 52.9645 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 1504561 34.2257 qemu-system-x86_64 ram_save_live > 382838 8.7088 qemu-system-x86_64 ram_save_block > 52050 1.1840 qemu-system-x86_64 cpu_get_physical_page_desc > 48975 1.1141 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > > rtt min/avg/max/mdev = 474.428/21033.451/134818.933/38245.396 ms, pipe 135 > > You can see that values/results are similar to what we had. > > with all patches > ---------------- > > no stalls, I stopped it after 438 seconds > > samples % image name symbol name > 387722 56.4676 qemu-system-x86_64 ram_save_block > 109500 15.9475 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 92328 13.4466 qemu-system-x86_64 cpu_get_physical_page_desc > 43573 6.3459 qemu-system-x86_64 phys_page_find_alloc > 18255 2.6586 qemu-system-x86_64 qemu_put_byte > 3940 0.5738 qemu-system-x86_64 qemu_put_be32 > 3621 0.5274 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 2591 0.3774 qemu-system-x86_64 ram_save_live > > and ping gives similar values to unload one. > > rtt min/avg/max/mdev = 474.400/486.094/548.479/15.820 ms > > Note: > > - I tested a version of this patches/algorithms with 400GB guests with > an old qemu-kvm version (0.9.1, the one in RHEL5. with so many > memory the handling of the dirty bitmap is the thing that end > causing stalls, will try to retest when I got access to the machines > again). > > > QEMU tree > --------- > > original qemu > ------------- > > savevm: save live iterate section id 2 name ram took 296 milliseconds 47 times > > stalls similar to qemu-kvm. > > migration: ended after 205938 milliseconds > > similar total time. > > samples % image name symbol name > 2158149 72.3752 qemu-system-x86_64 cpu_physical_memory_reset_dirty > 382016 12.8112 qemu-system-x86_64 ram_save_live > 367000 12.3076 qemu-system-x86_64 ram_save_block > 18012 0.6040 qemu-system-x86_64 qemu_put_byte > 10496 0.3520 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 7366 0.2470 qemu-system-x86_64 qemu_get_ram_ptr > > very bad ping times > rtt min/avg/max/mdev = 474.424/54575.554/159139.429/54473.043 ms, pipe 160 > > > with all patches applied (no load) > ---------------------------------- > > savevm: save live iterate section id 2 name ram took 109 milliseconds 1 times > > only one mini-stall, it is during stage 3 of savevm. > > migration: ended after 149529 milliseconds > > similar time (a bit faster indeed) > > samples % image name symbol name > 366803 73.9172 qemu-system-x86_64 ram_save_block > 31717 6.3915 qemu-system-x86_64 kvm_client_sync_dirty_bitmap > 16489 3.3228 qemu-system-x86_64 qemu_put_byte > 5512 1.1108 qemu-system-x86_64 main_loop_wait > 4886 0.9846 qemu-system-x86_64 cpu_exec_all > 3418 0.6888 qemu-system-x86_64 qemu_put_be32 > 3397 0.6846 qemu-system-x86_64 kvm_vcpu_ioctl > 3334 0.6719 [vdso] (tgid:18656 range:0x7ffff7ffe000-0x7ffff7fff000) [vdso] (tgid:18656 range:0x7ffff7ffe000-0x7ffff7fff000) > 2913 0.5870 qemu-system-x86_64 cpu_physical_memory_reset_dirty > > std deviation is a bit worse than qemu-kvm, but nothing to write home > rtt min/avg/max/mdev = 475.406/485.577/909.463/40.292 ms > > Juan Quintela (7): > Add spent time for migration > Add tracepoints for savevm section start/end > No need to iterate if we already are over the limit > Only TCG needs TLB handling > Only calculate expected_time for stage 2 > Exit loop if we have been there too long > Maintaing number of dirty pages > > arch_init.c | 40 ++++++++++++++++++++++------------------ > cpu-all.h | 1 + > exec-obsolete.h | 8 ++++++++ > exec.c | 33 +++++++++++++++++++++++---------- > hmp.c | 2 ++ > migration.c | 11 +++++++++++ > migration.h | 1 + > qapi-schema.json | 12 +++++++++--- > savevm.c | 11 +++++++++++ > trace-events | 6 ++++++ > 10 files changed, 94 insertions(+), 31 deletions(-) >