On 10/24/2012 6:49 AM, Chegu Vinod wrote:
> On 10/24/2012 6:40 AM, Vinod, Chegu wrote:
>>
>> Hi
>>
>> This series apply on top of the refactoring that I sent yesterday.
>>
>> Changes from the last version include:
>>
>> - buffered_file.c is gone, its functionality is merged in migration.c
>>
>>   special attention to the megre of buffered_file_thread() &
>>
>>   migration_file_put_notify().
>>
>> - Some more bitmap handling optimizations (thanks to Orit & Paolo for
>>
>>   suggestions and code and Vinod for testing)
>>
>> Please review.  Included is the pointer to the full tree.
>>
>> Thanks, Juan.
>>
>> The following changes since commit 
>> b6348f29d033d5a8a26f633d2ee94362595f32a4:
>>
>>   target-arm/translate: Fix RRX operands (2012-10-17 19:56:46 +0200)
>>
>> are available in the git repository at:
>>
>> http://repo.or.cz/r/qemu/quintela.git migration-thread-20121017
>>
>> for you to fetch changes up to 486dabc29f56d8f0e692395d4a6cd483b3a77f01:
>>
>>   ram: optimize migration bitmap walking (2012-10-18 09:20:34 +0200)
>>
>> v3:
>>
>> This is work in progress on top of the previous migration series just 
>> sent.
>>
>> - Introduces a thread for migration instead of using a timer and callback
>>
>> - remove the writting to the fd from the iothread lock
>>
>> - make the writes synchronous
>>
>> - Introduce a new pending method that returns how many bytes are 
>> pending for
>>
>>   one save live section
>>
>> - last patch just shows printfs to see where the time is being spent
>>
>>   on the migration complete phase.
>>
>>   (yes it pollutes all uses of stop on the monitor)
>>
>> So far I have found that we spent a lot of time on bdrv_flush_all() It
>>
>> can take from 1ms to 600ms (yes, it is not a typo).  That dwarfs the
>>
>> migration default downtime time (30ms).
>>
>> Stop all vcpus:
>>
>> - it works now (after the changes on qemu_cpu_is_vcpu on the previous
>>
>>   series) caveat is that the time that brdv_flush_all() takes is
>>
>>   "unpredictable".  Any silver bullets?
>>
>>   Paolo suggested to call for migration completion phase:
>>
>>   bdrv_aio_flush_all();
>>
>>   Sent the dirty pages;
>>
>>   bdrv_drain_all()
>>
>>   brdv_flush_all()
>>
>>   another round through the bitmap in case that completions have
>>
>>   changed some page
>>
>>   Paolo, did I get it right?
>>
>>   Any other suggestion?
>>
>> - migrate_cancel() is not properly implemented (as in the film that we
>>
>>   take no locks, ...)
>>
>> - expected_downtime is not calculated.
>>
>>   I am about to merge migrate_fd_put_ready & buffered_thread() and
>>
>>   that would make trivial to calculate.
>>
>> It outputs something like:
>>
>> wakeup_request 0
>>
>> time cpu_disable_ticks 0
>>
>> time pause_all_vcpus 1
>>
>> time runstate_set 1
>>
>> time vmstate_notify 2
>>
>> time bdrv_drain_all 2
>>
>> time flush device
>>
>> /dev/disk/by-path/ip-192.168.10.200:3260-iscsi-iqn.2010-12.org.trasno:iscsi.lvm-lun-1:
>>
>> 3
>>
>> time flush device : 3
>>
>> time flush device : 3
>>
>> time flush device : 3
>>
>> time bdrv_flush_all 5
>>
>> time monitor_protocol_event 5
>>
>> vm_stop 2 5
>>
>> synchronize_all_states 1
>>
>> migrate RAM 37
>>
>> migrate rest devices 1
>>
>> complete without error 3a 44
>>
>> completed 45
>>
>> end completed stage 45
>>
>> As you can see, we estimate that we can sent all pending data in 30ms,
>>
>> it took 37ms to send the RAM (that is what we calculate).  So
>>
>> estimation is quite good.
>>
>> What it gives me lots of variation is on the line with device name of 
>> "time
>>
>> flush device".
>>
>> That is what varies between 1ms to 600ms
>>
>> This is in a completely idle guest.  I am running:
>>
>>         while (1) {
>>
>>                 uint64_t delay;
>>
>>                 if (gettimeofday(&t0, NULL) != 0)
>>
>> perror("gettimeofday 1");
>>
>>                 if (usleep(ms2us(10)) != 0)
>>
>> perror("usleep");
>>
>>                 if (gettimeofday(&t1, NULL) != 0)
>>
>> perror("gettimeofday 2");
>>
>>                 t1.tv_usec -= t0.tv_usec;
>>
>>                 if (t1.tv_usec < 0) {
>>
>>                         t1.tv_usec += 1000000;
>>
>> t1.tv_sec--;
>>
>>                 }
>>
>>                 t1.tv_sec -= t0.tv_sec;
>>
>>                 delay = t1.tv_sec * 1000 + t1.tv_usec/1000;
>>
>>                 if (delay > 100)
>>
>> printf("delay of %ld ms\n", delay);
>>
>>        }
>>
>> To see the latency inside the guest (i.e. ask for a 10ms sleep, and 
>> see how
>>
>> long it takes).
>>
>> [root@d1 ~]# ./timer
>>
>> delay of 161 ms
>>
>> delay of 135 ms
>>
>> delay of 143 ms
>>
>> delay of 132 ms
>>
>> delay of 131 ms
>>
>> delay of 141 ms
>>
>> delay of 113 ms
>>
>> delay of 119 ms
>>
>> delay of 114 ms
>>
>> But that values are independent of migration.  Without even starting
>>
>> the migration, idle guest doing nothing, we get it sometimes.
>>
>> Juan Quintela (27):
>>
>>   buffered_file: Move from using a timer to use a thread
>>
>>   migration: make qemu_fopen_ops_buffered() return void
>>
>>   migration: stop all cpus correctly
>>
>>   migration: make writes blocking
>>
>>   migration: remove unfreeze logic
>>
>>   migration: take finer locking
>>
>>   buffered_file: Unfold the trick to restart generating migration data
>>
>>   buffered_file: don't flush on put buffer
>>
>>   buffered_file: unfold buffered_append in buffered_put_buffer
>>
>>   savevm: New save live migration method: pending
>>
>>   migration: include qemu-file.h
>>
>>   migration-fd: remove duplicate include
>>
>>   migration: move buffered_file.c code into migration.c
>>
>>   migration: move migration_fd_put_ready()
>>
>>   migration: Inline qemu_fopen_ops_buffered into migrate_fd_connect
>>
>>   migration: move migration notifier
>>
>>   migration: move begining stage to the migration thread
>>
>>   migration: move exit condition to migration thread
>>
>>   migration: unfold rest of migrate_fd_put_ready() into thread
>>
>>   migration: print times for end phase
>>
>>   ram: rename last_block to last_seen_block
>>
>>   ram: Add last_sent_block
>>
>>   memory: introduce memory_region_test_and_clear_dirty
>>
>>   ram: Use memory_region_test_and_clear_dirty
>>
>>   fix memory.c
>>
>>   migration: Only go to the iterate stage if there is anything to send
>>
>>   ram: optimize migration bitmap walking
>>
>> Paolo Bonzini (1):
>>
>>   split MRU ram list
>>
>> Umesh Deshpande (2):
>>
>>   add a version number to ram_list
>>
>>   protect the ramlist with a separate mutex
>>
>> Makefile.objs     |   2 +-
>>
>> arch_init.c       | 133 +++++++++++--------
>>
>> block-migration.c |  49 ++-----
>>
>> block.c           |   6 +
>>
>> buffered_file.c   | 256 -----------------------------------
>>
>> buffered_file.h   |  22 ---
>>
>> cpu-all.h         |  13 +-
>>
>> cpus.c            |  17 +++
>>
>> exec.c            |  44 +++++-
>>
>> memory.c          |  17 +++
>>
>> memory.h          |  18 +++
>>
>> migration-exec.c  |   4 +-
>>
>> migration-fd.c    |   9 +-
>>
>> migration-tcp.c   |  21 +--
>>
>> migration-unix.c  |   4 +-
>>
>> migration.c       | 391 
>> ++++++++++++++++++++++++++++++++++++++++--------------
>>
>> migration.h       |   4 +-
>>
>> qemu-file.h       |   5 -
>>
>> savevm.c          |  37 +++++-
>>
>> sysemu.h          |   1 +
>>
>> vmstate.h         |   1 +
>>
>> 21 files changed, 522 insertions(+), 532 deletions(-)
>>
>> delete mode 100644 buffered_file.c
>>
>> delete mode 100644 buffered_file.h
>>
>> -- 
>>
>> 1.7.11.7
>>
>
>
> Tested-by: Chegu Vinod <chegu_vinod@hp.com>
>
>
> Using these patches 'have verified live migration (on x86_64 
> platforms) for guest sizes varying from 64G/10vcpus thru 768G/80vcpus 
> and I have seen reduction in both the downtime as well as the total 
> migration time.  The dirty bitmap optimizations have shown 
> improvements too and have helped in the reduction of the downtime 
> (perhaps more can be done as a next step..i.e. after the above changes 
> (-minus the printf's) make it into upstream). The new migration stats 
> that were added were useful too !
>
> Thanks
> Vinod
>

Wanted to follow up on and issue that I had observed...  <Already shared 
this with Juan/Orit/Paolo but forgot to mention it in the email above!>

As mentioned above for larger (>= 256G ) sized guests the cost of dirty 
bitmap synch up is high. During the very start of the migration i.e. in 
ram_save_setup() ...noticed that a lot of time was being spent in 
synching up the dirty bitmaps  etc. (and also perhaps marking the pages 
as dirty etc)....this leads to a multiple second freeze on the guest. As 
part of optimizing the dirty bitmap synch up this issue needs to be 
addressed.

Thanks
Vinod