From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:37060)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1SdvkB-0005J4-Op
	for qemu-devel@nongnu.org; Sun, 10 Jun 2012 23:56:46 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Sdvk8-0003Tq-QV
	for qemu-devel@nongnu.org; Sun, 10 Jun 2012 23:56:43 -0400
Received: from g6t0184.atlanta.hp.com ([15.193.32.61]:29160)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Sdvk8-0003Tc-Ic
	for qemu-devel@nongnu.org; Sun, 10 Jun 2012 23:56:40 -0400
Message-ID: <4FD56C73.8080708@hp.com>
Date: Sun, 10 Jun 2012 20:56:35 -0700
From: Chegu Vinod <chegu_vinod@hp.com>
MIME-Version: 1.0
References: <cover.1337710679.git.quintela@redhat.com>
In-Reply-To: <cover.1337710679.git.quintela@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC 0/7] Fix migration with lots of memory
Reply-To: chegu_vinod@hp.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Juan Quintela <quintela@redhat.com>
Cc: qemu-devel@nongnu.org

Hello,

I did pick up these patches a while back and did run some migration tests while
running simple workloads in the guest. Below are some results.

FYI...
Vinod
----


Config Details:

Guest 10vcps, 60GB (running on a host that is 6cores(12threads) and 64GB).
The hosts are identical X86_64 Blade servers&  are connected via a private
10G link (for migration traffic)

Guest was started using qemu (no virsh/virt-manager etc).
Migration was initiated at the qemu monitor prompt
and the migration_set_speed was used to set to 10G. No changes
to the downtime.

Software:
- Guest&  the Host OS : 3.4.0-rc7+
- Vanilla : basic upstream qemu.git
- huge_memory changes(Juan's qemu.git tree)


[ Note : BTW, 'did also try vers:11 of XBZRLE patches...but ran into issues (guest crashed
after migration) 'have reported it to the author]


Here are the simple "workloads" and results:

1) Idling guest
2) AIM7-compute (with 2000 users).
3) 10way parallel make (of the kernel)
4) 2 instances of memory r/w loop (exactly the same as in docs/xbzrle.txt)
5) SPECJbb2005


Note: In the Vanilla case I had instrumented ram_save_live()
to print out the total migration time and the MB's transferred.

1) Idling guest:

Vanilla :
Total Mig. time: 173016 ms
Total MB's transferred : 1606MB

huge_memory:
Total Mig. time:  48821 ms
Total MB's transferred : 1620 MB

2) AIM7-compute  (2000 users)

Vanilla :
Total Mig. time: 241124 ms
Total MB's transferred : 4827MB

huge_memory:
Total Mig. time: 66716 ms
Total MB's transferred : 4022MB


3) 10 way parallel make: (of the linux kernel)

Vanilla :
Total Mig. time: 104319 ms
Total MB's transferred : 2316MB

huge_memory:
Total Mig. time: 55105 ms
Total MB's transferred : 2995MB


4) 2 instances of Memory r/w loop: (refer to docs/xbzrle.txt)

Vanilla :
Total Mig. time: 112102 ms
Total MB's transferred : 1739MB

huge_memory:
Total Mig. time: 85504ms
Total MB's transferred : 1745MB


5) SPECJbb :

Vanilla :
Total Mig. time: 162189 ms
Total MB's transferred : 5461MB

huge_memory:
Total Mig. time: 67787 ms
Total MB's transferred : 8528MB


[Expected] Observation :

Unlike with the Vanilla case(&  also the XBZRLE case), with these patches I was still able
to interact with the qemu monitor prompt and also interact with the guest during the migration (i.e. during the iterative pre-copy phase).


------


On 5/22/2012 11:32 AM, Juan Quintela wrote:
> Hi
>
> After a long, long time, this is v2.
>
> This are basically the changes that we have for RHEL, due to the
> problems that we have with big memory machines.  I just rebased the
> patches and fixed the easy parts:
>
> - buffered_file_limit is gone: we just use 50ms and call it a day
>
> - I let ram_addr_t as a valid type for a counter (no, I still don't
>    agree with Anthony on this, but it is not important).
>
> - Print total time of migration always.  Notice that I also print it
>    when migration is completed.  Luiz, could you take a look to see if
>    I did something worng (probably).
>
> - Moved debug printfs to tracepointns.  Thanks a lot to Stefan for
>    helping with it.  Once here, I had to put the traces in the middle
>    of trace-events file, if I put them on the end of the file, when I
>    enable them, I got generated the previous two tracepoints, instead
>    of the ones I just defined.  Stefan is looking on that.  Workaround
>    is defining them anywhere else.
>
> - exit from cpu_physical_memory_reset_dirty().  Anthony wanted that I
>    created an empty stub for kvm, and maintain the code for tcg.  The
>    problem is that we can have both kvm and tcg running from the same
>    binary.  Intead of exiting in the middle of the function, I just
>    refactored the code out.  Is there an struct where I could add a new
>    function pointer for this behaviour?
>
> - exit if we have been too long on ram_save_live() loop.  Anthony
>    didn't like this, I will sent a version based on the migration
>    thread in the following days.  But just need something working for
>    other people to test.
>
>    Notice that I still got "lots" of more than 50ms printf's. (Yes,
>    there is a debugging printf there).
>
> - Bitmap handling.  Still all code to count dirty pages, will try to
>    get something saner based on bitmap optimizations.
>
> Comments?
>
> Later, Juan.
>
>
>
>
> v1:
> ---
>
> Executive Summary
> -----------------
>
> This series of patches fix migration with lots of memory.  With them stalls
> are removed, and we honored max_dowtime.
> I also add infrastructure to measure what is happening during migration
> (#define DEBUG_MIGRATION and DEBUG_SAVEVM).
>
> Migration is broken at the momment in qemu tree, Michael patch is needed to
> fix virtio migration.  Measurements are given for qemu-kvm tree.  At the end, some measurements
> with qemu tree.
>
> Long Version with measurements (for those that like numbers O:-)
> ------------------------------
>
> 8 vCPUS and 64GB RAM, a RHEL5 guest that is completelly idle
>
> initial
> -------
>
>     savevm: save live iterate section id 3 name ram took 3266 milliseconds 46 times
>
> We have 46 stalls, and missed the 100ms deadline 46 times.
> stalls took around 3.5 and 3.6 seconds each.
>
>     savevm: save devices took 1 milliseconds
>
> if you had any doubt (rest of devices, not RAM) took less than 1ms, so
> we don't care for now to optimize them.
>
>     migration: ended after 207411 milliseconds
>
> total migration took 207 seconds for this guest
>
> samples  %        image name               symbol name
> 2161431  72.8297  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 379416   12.7845  qemu-system-x86_64       ram_save_live
> 367880   12.3958  qemu-system-x86_64       ram_save_block
> 16647     0.5609  qemu-system-x86_64       qemu_put_byte
> 10416     0.3510  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 9013      0.3037  qemu-system-x86_64       qemu_put_be32
>
> Clearly, we are spending too much time on cpu_physical_memory_reset_dirty.
>
> ping results during the migration.
>
> rtt min/avg/max/mdev = 474.395/39772.087/151843.178/55413.633 ms, pipe 152
>
> You can see that the mean and maximun values are quite big.
>
> We got in the guests the dreade: CPU softlookup for 10s
>
> No need to iterate if we already are over the limit
> ---------------------------------------------------
>
>     Numbers similar to previous ones.
>
> KVM don't care about TLB handling
> ---------------------------------
>
>     savevm: save livne iterate section id 3 name ram took 466 milliseconds 56 times
>
> 56 stalls, but much smaller, betweenn 0.5 and 1.4 seconds
>
>      migration: ended after 115949 milliseconds
>
> total time has improved a lot. 115 seconds.
>
> samples  %        image name               symbol name
> 431530   52.1152  qemu-system-x86_64       ram_save_live
> 355568   42.9414  qemu-system-x86_64       ram_save_block
> 14446     1.7446  qemu-system-x86_64       qemu_put_byte
> 11856     1.4318  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 3281      0.3962  qemu-system-x86_64       qemu_put_be32
> 2426      0.2930  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 2180      0.2633  qemu-system-x86_64       qemu_put_be64
>
> notice how cpu_physical_memory_dirty() use much less time.
>
> rtt min/avg/max/mdev = 474.438/1529.387/15578.055/2595.186 ms, pipe 16
>
> ping values from outside to the guest have improved a bit, but still
> bad.
>
> Exit loop if we have been there too long
> ----------------------------------------
>
> not a single stall bigger than 100ms
>
>     migration: ended after 157511 milliseconds
>
> not as good time as previous one, but we have removed the stalls.
>
> samples  %        image name               symbol name
> 1104546  71.8260  qemu-system-x86_64       ram_save_live
> 370472   24.0909  qemu-system-x86_64       ram_save_block
> 30419     1.9781  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 16252     1.0568  qemu-system-x86_64       qemu_put_byte
> 3400      0.2211  qemu-system-x86_64       qemu_put_be32
> 2657      0.1728  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 2206      0.1435  qemu-system-x86_64       qemu_put_be64
> 1559      0.1014  qemu-system-x86_64       qemu_file_rate_limit
>
>
> You can see that ping times are improving
>    rtt min/avg/max/mdev = 474.422/504.416/628.508/35.366 ms
>
> now the maximun is near the minimum, in reasonable values.
>
> The limit in the loop in stage loop has been put into 50ms because
> buffered_file run a timer each 100ms.  If we miss that timer, we ended
> having trouble.  So, I put 100/2.
>
> I tried other values: 15ms (max_downtime/2, so it could be set by the
> user), but gave too much total time (~400seconds).
>
> I tried bigger values, 75ms and 100ms, but with any of them we got
> stalls, some times as big as 1s, as we loss some timer run, and then
> calculations are wrong.
>
> With this patch, the softlookups are gone.
>
> Change calculation to exit live migration
> -----------------------------------------
>
> we spent too much time on ram_save_live(), the problem is the
> calculation of number of dirty pages (ram_save_remaining()).  Instead
> of walking the bitmap each time that we need the value, we just
> maintain the number of dirty pages each time that we change one value
> in the bitmap.
>
>     migration: ended after 151187 milliseconds
>
> same total time.
>
> samples  %        image name               symbol name
> 365104   84.1659  qemu-system-x86_64       ram_save_block
> 32048     7.3879  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 16033     3.6960  qemu-system-x86_64       qemu_put_byte
> 3383      0.7799  qemu-system-x86_64       qemu_put_be32
> 3028      0.6980  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 2174      0.5012  qemu-system-x86_64       qemu_put_be64
> 1953      0.4502  qemu-system-x86_64       ram_save_live
> 1408      0.3246  qemu-system-x86_64       qemu_file_rate_limit
>
> time is spent in ram_save_block() as expected.
>
> rtt min/avg/max/mdev = 474.412/492.713/539.419/21.896 ms
>
> std deviation is still better than without this.
>
>
> and now, with load on the guest!!!
> ----------------------------------
>
> will show only without my patches applied, and at the end (as with
> load it takes more time to run the tests).
>
> load is synthetic:
>
>   stress -c 2 -m 4 --vm-bytes 256M
>
> (2 cpu threads and two memory threads dirtying each 256MB RAM)
>
> Notice that we are dirtying too much memory to be able to migrate with
> the default downtime of 30ms.  What the migration should do is loop over
> but without having stalls.  To get the migration ending, I just kill the
> stress process after several iterations through all memory.
>
> initial
> -------
>
> same stalls that without load (stalls are caused when it finds lots of
> contiguous zero pages).
>
>
> samples  %        image name               symbol name
> 2328320  52.9645  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 1504561  34.2257  qemu-system-x86_64       ram_save_live
> 382838    8.7088  qemu-system-x86_64       ram_save_block
> 52050     1.1840  qemu-system-x86_64       cpu_get_physical_page_desc
> 48975     1.1141  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
>
> rtt min/avg/max/mdev = 474.428/21033.451/134818.933/38245.396 ms, pipe 135
>
> You can see that values/results are similar to what we had.
>
> with all patches
> ----------------
>
> no stalls, I stopped it after 438 seconds
>
> samples  %        image name               symbol name
> 387722   56.4676  qemu-system-x86_64       ram_save_block
> 109500   15.9475  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 92328    13.4466  qemu-system-x86_64       cpu_get_physical_page_desc
> 43573     6.3459  qemu-system-x86_64       phys_page_find_alloc
> 18255     2.6586  qemu-system-x86_64       qemu_put_byte
> 3940      0.5738  qemu-system-x86_64       qemu_put_be32
> 3621      0.5274  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 2591      0.3774  qemu-system-x86_64       ram_save_live
>
> and ping gives similar values to unload one.
>
> rtt min/avg/max/mdev = 474.400/486.094/548.479/15.820 ms
>
> Note:
>
> - I tested a version of this patches/algorithms with 400GB guests with
>    an old qemu-kvm version (0.9.1, the one in RHEL5.  with so many
>    memory the handling of the dirty bitmap is the thing that end
>    causing stalls, will try to retest when I got access to the machines
>    again).
>
>
> QEMU tree
> ---------
>
> original qemu
> -------------
>
>     savevm: save live iterate section id 2 name ram took 296 milliseconds 47 times
>
> stalls similar to qemu-kvm.
>
>    migration: ended after 205938 milliseconds
>
> similar total time.
>
> samples  %        image name               symbol name
> 2158149  72.3752  qemu-system-x86_64       cpu_physical_memory_reset_dirty
> 382016   12.8112  qemu-system-x86_64       ram_save_live
> 367000   12.3076  qemu-system-x86_64       ram_save_block
> 18012     0.6040  qemu-system-x86_64       qemu_put_byte
> 10496     0.3520  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 7366      0.2470  qemu-system-x86_64       qemu_get_ram_ptr
>
> very bad ping times
>     rtt min/avg/max/mdev = 474.424/54575.554/159139.429/54473.043 ms, pipe 160
>
>
> with all patches applied (no load)
> ----------------------------------
>
>     savevm: save live iterate section id 2 name ram took 109 milliseconds 1 times
>
> only one mini-stall, it is during stage 3 of savevm.
>
>     migration: ended after 149529 milliseconds
>
> similar time (a bit faster indeed)
>
> samples  %        image name               symbol name
> 366803   73.9172  qemu-system-x86_64       ram_save_block
> 31717     6.3915  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
> 16489     3.3228  qemu-system-x86_64       qemu_put_byte
> 5512      1.1108  qemu-system-x86_64       main_loop_wait
> 4886      0.9846  qemu-system-x86_64       cpu_exec_all
> 3418      0.6888  qemu-system-x86_64       qemu_put_be32
> 3397      0.6846  qemu-system-x86_64       kvm_vcpu_ioctl
> 3334      0.6719  [vdso] (tgid:18656 range:0x7ffff7ffe000-0x7ffff7fff000) [vdso] (tgid:18656 range:0x7ffff7ffe000-0x7ffff7fff000)
> 2913      0.5870  qemu-system-x86_64       cpu_physical_memory_reset_dirty
>
> std deviation is a bit worse than qemu-kvm, but nothing to write home
>     rtt min/avg/max/mdev = 475.406/485.577/909.463/40.292 ms
>
> Juan Quintela (7):
>    Add spent time for migration
>    Add tracepoints for savevm section start/end
>    No need to iterate if we already are over the limit
>    Only TCG needs TLB handling
>    Only calculate expected_time for stage 2
>    Exit loop if we have been there too long
>    Maintaing number of dirty pages
>
>   arch_init.c      |   40 ++++++++++++++++++++++------------------
>   cpu-all.h        |    1 +
>   exec-obsolete.h  |    8 ++++++++
>   exec.c           |   33 +++++++++++++++++++++++----------
>   hmp.c            |    2 ++
>   migration.c      |   11 +++++++++++
>   migration.h      |    1 +
>   qapi-schema.json |   12 +++++++++---
>   savevm.c         |   11 +++++++++++
>   trace-events     |    6 ++++++
>   10 files changed, 94 insertions(+), 31 deletions(-)
>