On 10/24/2012 6:49 AM, Chegu Vinod wrote: > On 10/24/2012 6:40 AM, Vinod, Chegu wrote: >> >> Hi >> >> This series apply on top of the refactoring that I sent yesterday. >> >> Changes from the last version include: >> >> - buffered_file.c is gone, its functionality is merged in migration.c >> >> special attention to the megre of buffered_file_thread() & >> >> migration_file_put_notify(). >> >> - Some more bitmap handling optimizations (thanks to Orit & Paolo for >> >> suggestions and code and Vinod for testing) >> >> Please review. Included is the pointer to the full tree. >> >> Thanks, Juan. >> >> The following changes since commit >> b6348f29d033d5a8a26f633d2ee94362595f32a4: >> >> target-arm/translate: Fix RRX operands (2012-10-17 19:56:46 +0200) >> >> are available in the git repository at: >> >> http://repo.or.cz/r/qemu/quintela.git migration-thread-20121017 >> >> for you to fetch changes up to 486dabc29f56d8f0e692395d4a6cd483b3a77f01: >> >> ram: optimize migration bitmap walking (2012-10-18 09:20:34 +0200) >> >> v3: >> >> This is work in progress on top of the previous migration series just >> sent. >> >> - Introduces a thread for migration instead of using a timer and callback >> >> - remove the writting to the fd from the iothread lock >> >> - make the writes synchronous >> >> - Introduce a new pending method that returns how many bytes are >> pending for >> >> one save live section >> >> - last patch just shows printfs to see where the time is being spent >> >> on the migration complete phase. >> >> (yes it pollutes all uses of stop on the monitor) >> >> So far I have found that we spent a lot of time on bdrv_flush_all() It >> >> can take from 1ms to 600ms (yes, it is not a typo). That dwarfs the >> >> migration default downtime time (30ms). >> >> Stop all vcpus: >> >> - it works now (after the changes on qemu_cpu_is_vcpu on the previous >> >> series) caveat is that the time that brdv_flush_all() takes is >> >> "unpredictable". Any silver bullets? >> >> Paolo suggested to call for migration completion phase: >> >> bdrv_aio_flush_all(); >> >> Sent the dirty pages; >> >> bdrv_drain_all() >> >> brdv_flush_all() >> >> another round through the bitmap in case that completions have >> >> changed some page >> >> Paolo, did I get it right? >> >> Any other suggestion? >> >> - migrate_cancel() is not properly implemented (as in the film that we >> >> take no locks, ...) >> >> - expected_downtime is not calculated. >> >> I am about to merge migrate_fd_put_ready & buffered_thread() and >> >> that would make trivial to calculate. >> >> It outputs something like: >> >> wakeup_request 0 >> >> time cpu_disable_ticks 0 >> >> time pause_all_vcpus 1 >> >> time runstate_set 1 >> >> time vmstate_notify 2 >> >> time bdrv_drain_all 2 >> >> time flush device >> >> /dev/disk/by-path/ip-192.168.10.200:3260-iscsi-iqn.2010-12.org.trasno:iscsi.lvm-lun-1: >> >> 3 >> >> time flush device : 3 >> >> time flush device : 3 >> >> time flush device : 3 >> >> time bdrv_flush_all 5 >> >> time monitor_protocol_event 5 >> >> vm_stop 2 5 >> >> synchronize_all_states 1 >> >> migrate RAM 37 >> >> migrate rest devices 1 >> >> complete without error 3a 44 >> >> completed 45 >> >> end completed stage 45 >> >> As you can see, we estimate that we can sent all pending data in 30ms, >> >> it took 37ms to send the RAM (that is what we calculate). So >> >> estimation is quite good. >> >> What it gives me lots of variation is on the line with device name of >> "time >> >> flush device". >> >> That is what varies between 1ms to 600ms >> >> This is in a completely idle guest. I am running: >> >> while (1) { >> >> uint64_t delay; >> >> if (gettimeofday(&t0, NULL) != 0) >> >> perror("gettimeofday 1"); >> >> if (usleep(ms2us(10)) != 0) >> >> perror("usleep"); >> >> if (gettimeofday(&t1, NULL) != 0) >> >> perror("gettimeofday 2"); >> >> t1.tv_usec -= t0.tv_usec; >> >> if (t1.tv_usec < 0) { >> >> t1.tv_usec += 1000000; >> >> t1.tv_sec--; >> >> } >> >> t1.tv_sec -= t0.tv_sec; >> >> delay = t1.tv_sec * 1000 + t1.tv_usec/1000; >> >> if (delay > 100) >> >> printf("delay of %ld ms\n", delay); >> >> } >> >> To see the latency inside the guest (i.e. ask for a 10ms sleep, and >> see how >> >> long it takes). >> >> [root@d1 ~]# ./timer >> >> delay of 161 ms >> >> delay of 135 ms >> >> delay of 143 ms >> >> delay of 132 ms >> >> delay of 131 ms >> >> delay of 141 ms >> >> delay of 113 ms >> >> delay of 119 ms >> >> delay of 114 ms >> >> But that values are independent of migration. Without even starting >> >> the migration, idle guest doing nothing, we get it sometimes. >> >> Juan Quintela (27): >> >> buffered_file: Move from using a timer to use a thread >> >> migration: make qemu_fopen_ops_buffered() return void >> >> migration: stop all cpus correctly >> >> migration: make writes blocking >> >> migration: remove unfreeze logic >> >> migration: take finer locking >> >> buffered_file: Unfold the trick to restart generating migration data >> >> buffered_file: don't flush on put buffer >> >> buffered_file: unfold buffered_append in buffered_put_buffer >> >> savevm: New save live migration method: pending >> >> migration: include qemu-file.h >> >> migration-fd: remove duplicate include >> >> migration: move buffered_file.c code into migration.c >> >> migration: move migration_fd_put_ready() >> >> migration: Inline qemu_fopen_ops_buffered into migrate_fd_connect >> >> migration: move migration notifier >> >> migration: move begining stage to the migration thread >> >> migration: move exit condition to migration thread >> >> migration: unfold rest of migrate_fd_put_ready() into thread >> >> migration: print times for end phase >> >> ram: rename last_block to last_seen_block >> >> ram: Add last_sent_block >> >> memory: introduce memory_region_test_and_clear_dirty >> >> ram: Use memory_region_test_and_clear_dirty >> >> fix memory.c >> >> migration: Only go to the iterate stage if there is anything to send >> >> ram: optimize migration bitmap walking >> >> Paolo Bonzini (1): >> >> split MRU ram list >> >> Umesh Deshpande (2): >> >> add a version number to ram_list >> >> protect the ramlist with a separate mutex >> >> Makefile.objs | 2 +- >> >> arch_init.c | 133 +++++++++++-------- >> >> block-migration.c | 49 ++----- >> >> block.c | 6 + >> >> buffered_file.c | 256 ----------------------------------- >> >> buffered_file.h | 22 --- >> >> cpu-all.h | 13 +- >> >> cpus.c | 17 +++ >> >> exec.c | 44 +++++- >> >> memory.c | 17 +++ >> >> memory.h | 18 +++ >> >> migration-exec.c | 4 +- >> >> migration-fd.c | 9 +- >> >> migration-tcp.c | 21 +-- >> >> migration-unix.c | 4 +- >> >> migration.c | 391 >> ++++++++++++++++++++++++++++++++++++++++-------------- >> >> migration.h | 4 +- >> >> qemu-file.h | 5 - >> >> savevm.c | 37 +++++- >> >> sysemu.h | 1 + >> >> vmstate.h | 1 + >> >> 21 files changed, 522 insertions(+), 532 deletions(-) >> >> delete mode 100644 buffered_file.c >> >> delete mode 100644 buffered_file.h >> >> -- >> >> 1.7.11.7 >> > > > Tested-by: Chegu Vinod > > > Using these patches 'have verified live migration (on x86_64 > platforms) for guest sizes varying from 64G/10vcpus thru 768G/80vcpus > and I have seen reduction in both the downtime as well as the total > migration time. The dirty bitmap optimizations have shown > improvements too and have helped in the reduction of the downtime > (perhaps more can be done as a next step..i.e. after the above changes > (-minus the printf's) make it into upstream). The new migration stats > that were added were useful too ! > > Thanks > Vinod > Wanted to follow up on and issue that I had observed... As mentioned above for larger (>= 256G ) sized guests the cost of dirty bitmap synch up is high. During the very start of the migration i.e. in ram_save_setup() ...noticed that a lot of time was being spent in synching up the dirty bitmaps etc. (and also perhaps marking the pages as dirty etc)....this leads to a multiple second freeze on the guest. As part of optimizing the dirty bitmap synch up this issue needs to be addressed. Thanks Vinod