[Qemu-devel] [PATCH 00/10] Fix migration with lots of memory

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory
@ 2010-11-23 23:02 Juan Quintela
  2010-11-23 23:02 ` [Qemu-devel] [PATCH 01/10] Add spent time to migration Juan Quintela
                   ` (9 more replies)
  0 siblings, 10 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:02 UTC (permalink / raw)
  To: qemu-devel

Hi

Executive Summary
-----------------

This series of patches fix migration with lots of memory.  With them stalls
are removed, and we honored max_dowtime.
I also add infrastructure to measure what is happening during migration
(#define DEBUG_MIGRATION and DEBUG_SAVEVM).

Migration is broken at the momment in qemu tree, Michael patch is needed to
fix virtio migration.  Measurements are given for qemu-kvm tree.  At the end, some measurements
with qemu tree.

Long Version with measurements (for those that like numbers O:-)
------------------------------

8 vCPUS and 64GB RAM, a RHEL5 guest that is completelly idle

initial
-------

   savevm: save live iterate section id 3 name ram took 3266 milliseconds 46 times

We have 46 stalls, and missed the 100ms deadline 46 times.
stalls took around 3.5 and 3.6 seconds each.

   savevm: save devices took 1 milliseconds

if you had any doubt (rest of devices, not RAM) took less than 1ms, so
we don't care for now to optimize them.

   migration: ended after 207411 milliseconds

total migration took 207 seconds for this guest

samples  %        image name               symbol name
2161431  72.8297  qemu-system-x86_64       cpu_physical_memory_reset_dirty
379416   12.7845  qemu-system-x86_64       ram_save_live
367880   12.3958  qemu-system-x86_64       ram_save_block
16647     0.5609  qemu-system-x86_64       qemu_put_byte
10416     0.3510  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
9013      0.3037  qemu-system-x86_64       qemu_put_be32

Clearly, we are spending too much time on cpu_physical_memory_reset_dirty.

ping results during the migration.

rtt min/avg/max/mdev = 474.395/39772.087/151843.178/55413.633 ms, pipe 152

You can see that the mean and maximun values are quite big.

We got in the guests the dreade: CPU softlookup for 10s

No need to iterate if we already are over the limit
---------------------------------------------------

   Numbers similar to previous ones.

KVM don't care about TLB handling
---------------------------------

   savevm: save livne iterate section id 3 name ram took 466 milliseconds 56 times

56 stalls, but much smaller, betweenn 0.5 and 1.4 seconds

    migration: ended after 115949 milliseconds

total time has improved a lot. 115 seconds.

samples  %        image name               symbol name
431530   52.1152  qemu-system-x86_64       ram_save_live
355568   42.9414  qemu-system-x86_64       ram_save_block
14446     1.7446  qemu-system-x86_64       qemu_put_byte
11856     1.4318  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
3281      0.3962  qemu-system-x86_64       qemu_put_be32
2426      0.2930  qemu-system-x86_64       cpu_physical_memory_reset_dirty
2180      0.2633  qemu-system-x86_64       qemu_put_be64

notice how cpu_physical_memory_dirty() use much less time.

rtt min/avg/max/mdev = 474.438/1529.387/15578.055/2595.186 ms, pipe 16

ping values from outside to the guest have improved a bit, but still
bad.

Exit loop if we have been there too long
----------------------------------------

not a single stall bigger than 100ms

   migration: ended after 157511 milliseconds

not as good time as previous one, but we have removed the stalls.

samples  %        image name               symbol name
1104546  71.8260  qemu-system-x86_64       ram_save_live
370472   24.0909  qemu-system-x86_64       ram_save_block
30419     1.9781  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
16252     1.0568  qemu-system-x86_64       qemu_put_byte
3400      0.2211  qemu-system-x86_64       qemu_put_be32
2657      0.1728  qemu-system-x86_64       cpu_physical_memory_reset_dirty
2206      0.1435  qemu-system-x86_64       qemu_put_be64
1559      0.1014  qemu-system-x86_64       qemu_file_rate_limit


You can see that ping times are improving
  rtt min/avg/max/mdev = 474.422/504.416/628.508/35.366 ms

now the maximun is near the minimum, in reasonable values.

The limit in the loop in stage loop has been put into 50ms because
buffered_file run a timer each 100ms.  If we miss that timer, we ended
having trouble.  So, I put 100/2.

I tried other values: 15ms (max_downtime/2, so it could be set by the
user), but gave too much total time (~400seconds).

I tried bigger values, 75ms and 100ms, but with any of them we got
stalls, some times as big as 1s, as we loss some timer run, and then
calculations are wrong.

With this patch, the softlookups are gone.

Change calculation to exit live migration
-----------------------------------------

we spent too much time on ram_save_live(), the problem is the
calculation of number of dirty pages (ram_save_remaining()).  Instead
of walking the bitmap each time that we need the value, we just
maintain the number of dirty pages each time that we change one value
in the bitmap.

   migration: ended after 151187 milliseconds

same total time.

samples  %        image name               symbol name
365104   84.1659  qemu-system-x86_64       ram_save_block
32048     7.3879  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
16033     3.6960  qemu-system-x86_64       qemu_put_byte
3383      0.7799  qemu-system-x86_64       qemu_put_be32
3028      0.6980  qemu-system-x86_64       cpu_physical_memory_reset_dirty
2174      0.5012  qemu-system-x86_64       qemu_put_be64
1953      0.4502  qemu-system-x86_64       ram_save_live
1408      0.3246  qemu-system-x86_64       qemu_file_rate_limit

time is spent in ram_save_block() as expected.

rtt min/avg/max/mdev = 474.412/492.713/539.419/21.896 ms

std deviation is still better than without this.


and now, with load on the guest!!!
----------------------------------

will show only without my patches applied, and at the end (as with
load it takes more time to run the tests).

load is synthetic:

 stress -c 2 -m 4 --vm-bytes 256M

(2 cpu threads and two memory threads dirtying each 256MB RAM)

Notice that we are dirtying too much memory to be able to migrate with
the default downtime of 30ms.  What the migration should do is loop over
but without having stalls.  To get the migration ending, I just kill the
stress process after several iterations through all memory.

initial
-------

same stalls that without load (stalls are caused when it finds lots of
contiguous zero pages).


samples  %        image name               symbol name
2328320  52.9645  qemu-system-x86_64       cpu_physical_memory_reset_dirty
1504561  34.2257  qemu-system-x86_64       ram_save_live
382838    8.7088  qemu-system-x86_64       ram_save_block
52050     1.1840  qemu-system-x86_64       cpu_get_physical_page_desc
48975     1.1141  qemu-system-x86_64       kvm_client_sync_dirty_bitmap

rtt min/avg/max/mdev = 474.428/21033.451/134818.933/38245.396 ms, pipe 135

You can see that values/results are similar to what we had.

with all patches
----------------

no stalls, I stopped it after 438 seconds

samples  %        image name               symbol name
387722   56.4676  qemu-system-x86_64       ram_save_block
109500   15.9475  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
92328    13.4466  qemu-system-x86_64       cpu_get_physical_page_desc
43573     6.3459  qemu-system-x86_64       phys_page_find_alloc
18255     2.6586  qemu-system-x86_64       qemu_put_byte
3940      0.5738  qemu-system-x86_64       qemu_put_be32
3621      0.5274  qemu-system-x86_64       cpu_physical_memory_reset_dirty
2591      0.3774  qemu-system-x86_64       ram_save_live

and ping gives similar values to unload one.

rtt min/avg/max/mdev = 474.400/486.094/548.479/15.820 ms

Note:

- I tested a version of this patches/algorithms with 400GB guests with
  an old qemu-kvm version (0.9.1, the one in RHEL5.  with so many
  memory the handling of the dirty bitmap is the thing that end
  causing stalls, will try to retest when I got access to the machines
  again).


QEMU tree
---------

original qemu
-------------

   savevm: save live iterate section id 2 name ram took 296 milliseconds 47 times

stalls similar to qemu-kvm.

  migration: ended after 205938 milliseconds

similar total time.

samples  %        image name               symbol name
2158149  72.3752  qemu-system-x86_64       cpu_physical_memory_reset_dirty
382016   12.8112  qemu-system-x86_64       ram_save_live
367000   12.3076  qemu-system-x86_64       ram_save_block
18012     0.6040  qemu-system-x86_64       qemu_put_byte
10496     0.3520  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
7366      0.2470  qemu-system-x86_64       qemu_get_ram_ptr

very bad ping times
   rtt min/avg/max/mdev = 474.424/54575.554/159139.429/54473.043 ms, pipe 160


with all patches applied (no load)
----------------------------------

   savevm: save live iterate section id 2 name ram took 109 milliseconds 1 times

only one mini-stall, it is during stage 3 of savevm.

   migration: ended after 149529 milliseconds

similar time (a bit faster indeed)

samples  %        image name               symbol name
366803   73.9172  qemu-system-x86_64       ram_save_block
31717     6.3915  qemu-system-x86_64       kvm_client_sync_dirty_bitmap
16489     3.3228  qemu-system-x86_64       qemu_put_byte
5512      1.1108  qemu-system-x86_64       main_loop_wait
4886      0.9846  qemu-system-x86_64       cpu_exec_all
3418      0.6888  qemu-system-x86_64       qemu_put_be32
3397      0.6846  qemu-system-x86_64       kvm_vcpu_ioctl
3334      0.6719  [vdso] (tgid:18656 range:0x7ffff7ffe000-0x7ffff7fff000) [vdso] (tgid:18656 range:0x7ffff7ffe000-0x7ffff7fff000)
2913      0.5870  qemu-system-x86_64       cpu_physical_memory_reset_dirty

std deviation is a bit worse than qemu-kvm, but nothing to write home
   rtt min/avg/max/mdev = 475.406/485.577/909.463/40.292 ms




Juan Quintela (10):
  Add spent time to migration
  Add buffered_file_internal constant
  Add printf debug to savevm
  No need to iterate if we already are over the limit
  KVM don't care about TLB handling
  Only calculate expected_time for stage 2
  ram_save_remaining() returns an uint64_t
  Count nanoseconds with uint64_t not doubles
  Exit loop if we have been there too long
  Maintaing number of dirty pages

 arch_init.c     |   50 ++++++++++++++++++++++++++++----------------------
 buffered_file.c |    6 ++++--
 buffered_file.h |    2 ++
 cpu-all.h       |    7 +++++++
 exec.c          |    4 ++++
 migration.c     |   13 +++++++++++++
 savevm.c        |   51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 7 files changed, 107 insertions(+), 26 deletions(-)

-- 
1.7.3.2

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 01/10] Add spent time to migration
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
@ 2010-11-23 23:02 ` Juan Quintela
  2010-11-23 23:02 ` [Qemu-devel] [PATCH 02/10] Add buffered_file_internal constant Juan Quintela
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

When printing debug information for migration, print total time spent.

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/migration.c b/migration.c
index 9ee8b17..4786406 100644
--- a/migration.c
+++ b/migration.c
@@ -26,9 +26,16 @@
 #ifdef DEBUG_MIGRATION
 #define DPRINTF(fmt, ...) \
     do { printf("migration: " fmt, ## __VA_ARGS__); } while (0)
+static int64_t start, stop;
+#define START_MIGRATION_CLOCK()	do { start = qemu_get_clock(rt_clock); } while (0)
+#define STOP_MIGRATION_CLOCK() \
+	do { stop = qemu_get_clock(rt_clock) - start; \
+	} while (0)
 #else
 #define DPRINTF(fmt, ...) \
     do { } while (0)
+#define START_MIGRATION_CLOCK()	do {} while (0)
+#define STOP_MIGRATION_CLOCK()	do {} while (0)
 #endif

 /* Migration speed throttling */
@@ -88,6 +95,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
         return -1;
     }

+    START_MIGRATION_CLOCK();
     if (strstart(uri, "tcp:", &p)) {
         s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
                                          blk, inc);
@@ -127,6 +135,8 @@ int do_migrate_cancel(Monitor *mon, const QDict *qdict, QObject **ret_data)
     if (s)
         s->cancel(s);

+    STOP_MIGRATION_CLOCK();
+    DPRINTF("canceled after %lu milliseconds\n", stop);
     return 0;
 }

@@ -378,6 +388,9 @@ void migrate_fd_put_ready(void *opaque)
         } else {
             state = MIG_STATE_COMPLETED;
         }
+	STOP_MIGRATION_CLOCK();
+	DPRINTF("ended after %lu milliseconds\n", stop);
+
         if (migrate_fd_cleanup(s) < 0) {
             if (old_vm_running) {
                 vm_start();
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 02/10] Add buffered_file_internal constant
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
  2010-11-23 23:02 ` [Qemu-devel] [PATCH 01/10] Add spent time to migration Juan Quintela
@ 2010-11-23 23:02 ` Juan Quintela
  2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 03/10] Add printf debug to savevm Juan Quintela
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

This time is each time that buffered_file ticks happen

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 buffered_file.c |    6 ++++--
 buffered_file.h |    2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 1836e7e..1f492e6 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -20,6 +20,8 @@

 //#define DEBUG_BUFFERED_FILE

+const int buffered_file_interval = 100;
+
 typedef struct QEMUFileBuffered
 {
     BufferedPutFunc *put_buffer;
@@ -235,7 +237,7 @@ static void buffered_rate_tick(void *opaque)
         return;
     }

-    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + 100);
+    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + buffered_file_interval);

     if (s->freeze_output)
         return;
@@ -273,7 +275,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,

     s->timer = qemu_new_timer(rt_clock, buffered_rate_tick, s);

-    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + 100);
+    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + buffered_file_interval);

     return s->file;
 }
diff --git a/buffered_file.h b/buffered_file.h
index 98d358b..a728316 100644
--- a/buffered_file.h
+++ b/buffered_file.h
@@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
 typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
 typedef int (BufferedCloseFunc)(void *opaque);

+extern const int buffered_file_interval;
+
 QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
                                   BufferedPutFunc *put_buffer,
                                   BufferedPutReadyFunc *put_ready,
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-23 23:02 ` [Qemu-devel] [PATCH 02/10] Add buffered_file_internal constant Juan Quintela
@ 2010-11-24 10:40   ` Michael S. Tsirkin
  2010-11-24 10:52     ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 10:40 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Juan Quintela

On Wed, Nov 24, 2010 at 12:02:59AM +0100, Juan Quintela wrote:
> From: Juan Quintela <quintela@trasno.org>
> 
> This time is each time that buffered_file ticks happen
> 
> Signed-off-by: Juan Quintela <quintela@trasno.org>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  buffered_file.c |    6 ++++--
>  buffered_file.h |    2 ++
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/buffered_file.c b/buffered_file.c
> index 1836e7e..1f492e6 100644
> --- a/buffered_file.c
> +++ b/buffered_file.c
> @@ -20,6 +20,8 @@
> 
>  //#define DEBUG_BUFFERED_FILE
> 
> +const int buffered_file_interval = 100;
> +
>  typedef struct QEMUFileBuffered
>  {
>      BufferedPutFunc *put_buffer;
> @@ -235,7 +237,7 @@ static void buffered_rate_tick(void *opaque)
>          return;
>      }
> 
> -    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + 100);
> +    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + buffered_file_interval);
> 
>      if (s->freeze_output)
>          return;
> @@ -273,7 +275,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
> 
>      s->timer = qemu_new_timer(rt_clock, buffered_rate_tick, s);
> 
> -    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + 100);
> +    qemu_mod_timer(s->timer, qemu_get_clock(rt_clock) + buffered_file_interval);
> 
>      return s->file;
>  }
> diff --git a/buffered_file.h b/buffered_file.h
> index 98d358b..a728316 100644
> --- a/buffered_file.h
> +++ b/buffered_file.h
> @@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
>  typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
>  typedef int (BufferedCloseFunc)(void *opaque);
> 
> +extern const int buffered_file_interval;
> +

This shouldn't be exported, IMO.

>  QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
>                                    BufferedPutFunc *put_buffer,
>                                    BufferedPutReadyFunc *put_ready,
> -- 
> 1.7.3.2
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
@ 2010-11-24 10:52     ` Juan Quintela
  2010-11-24 11:04       ` Michael S. Tsirkin
       [not found]       ` <4CF46012.2060804@codemonkey.ws>
  0 siblings, 2 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-24 10:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Nov 24, 2010 at 12:02:59AM +0100, Juan Quintela wrote:
>> From: Juan Quintela <quintela@trasno.org>

>> diff --git a/buffered_file.h b/buffered_file.h
>> index 98d358b..a728316 100644
>> --- a/buffered_file.h
>> +++ b/buffered_file.h
>> @@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
>>  typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
>>  typedef int (BufferedCloseFunc)(void *opaque);
>> 
>> +extern const int buffered_file_interval;
>> +
>
> This shouldn't be exported, IMO.

What do you want?  an accessor function?  Notice that it is a constant.
We need the value in other places, see the last patch.

Otherwise, I have to pick random numbers like ... 50ms.

Later, Juan.


>
>>  QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
>>                                    BufferedPutFunc *put_buffer,
>>                                    BufferedPutReadyFunc *put_ready,
>> -- 
>> 1.7.3.2
>> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-24 10:52     ` Juan Quintela
@ 2010-11-24 11:04       ` Michael S. Tsirkin
  2010-11-24 11:13         ` Juan Quintela
       [not found]       ` <4CF46012.2060804@codemonkey.ws>
  1 sibling, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 11:04 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

On Wed, Nov 24, 2010 at 11:52:25AM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Nov 24, 2010 at 12:02:59AM +0100, Juan Quintela wrote:
> >> From: Juan Quintela <quintela@trasno.org>
> 
> >> diff --git a/buffered_file.h b/buffered_file.h
> >> index 98d358b..a728316 100644
> >> --- a/buffered_file.h
> >> +++ b/buffered_file.h
> >> @@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
> >>  typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
> >>  typedef int (BufferedCloseFunc)(void *opaque);
> >> 
> >> +extern const int buffered_file_interval;
> >> +
> >
> > This shouldn't be exported, IMO.
> 
> What do you want?  an accessor function?

I mean an abstraction at qemu_file_ level.

>  Notice that it is a constant.
> We need the value in other places, see the last patch.
> 
> Otherwise, I have to pick random numbers like ... 50ms.
> 
> Later, Juan.

If you need a random number, use a random number :)

> 
> >
> >>  QEMUFile *qemu_fopen_ops_buffered(void *opaque, size_t xfer_limit,
> >>                                    BufferedPutFunc *put_buffer,
> >>                                    BufferedPutReadyFunc *put_ready,
> >> -- 
> >> 1.7.3.2
> >> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-24 11:04       ` Michael S. Tsirkin
@ 2010-11-24 11:13         ` Juan Quintela
  2010-11-24 11:19           ` Michael S. Tsirkin
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-24 11:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Nov 24, 2010 at 11:52:25AM +0100, Juan Quintela wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> > On Wed, Nov 24, 2010 at 12:02:59AM +0100, Juan Quintela wrote:
>> >> From: Juan Quintela <quintela@trasno.org>
>> 
>> >> diff --git a/buffered_file.h b/buffered_file.h
>> >> index 98d358b..a728316 100644
>> >> --- a/buffered_file.h
>> >> +++ b/buffered_file.h
>> >> @@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
>> >>  typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
>> >>  typedef int (BufferedCloseFunc)(void *opaque);
>> >> 
>> >> +extern const int buffered_file_interval;
>> >> +
>> >
>> > This shouldn't be exported, IMO.
>> 
>> What do you want?  an accessor function?
>
> I mean an abstraction at qemu_file_ level.
>
>>  Notice that it is a constant.
>> We need the value in other places, see the last patch.
>> 
>> Otherwise, I have to pick random numbers like ... 50ms.
>> 
>> Later, Juan.
>
> If you need a random number, use a random number :)

It is not random, it is the bigger number that lets the timer to run at
each 100ms O:-)

I tried 100ms and 75ms, they got better throughput, but I got still
stalls of 1.5s.

I could do some search between 50 and 75ms, but the advantage in
bandwidth is not so big to complicate things.  (150 vs 125s for total
migration time or so, and system is quite irresponsive with the bigger
value).

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-24 11:13         ` Juan Quintela
@ 2010-11-24 11:19           ` Michael S. Tsirkin
  0 siblings, 0 replies; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 11:19 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

On Wed, Nov 24, 2010 at 12:13:26PM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Nov 24, 2010 at 11:52:25AM +0100, Juan Quintela wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> > On Wed, Nov 24, 2010 at 12:02:59AM +0100, Juan Quintela wrote:
> >> >> From: Juan Quintela <quintela@trasno.org>
> >> 
> >> >> diff --git a/buffered_file.h b/buffered_file.h
> >> >> index 98d358b..a728316 100644
> >> >> --- a/buffered_file.h
> >> >> +++ b/buffered_file.h
> >> >> @@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
> >> >>  typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
> >> >>  typedef int (BufferedCloseFunc)(void *opaque);
> >> >> 
> >> >> +extern const int buffered_file_interval;
> >> >> +
> >> >
> >> > This shouldn't be exported, IMO.
> >> 
> >> What do you want?  an accessor function?
> >
> > I mean an abstraction at qemu_file_ level.
> >
> >>  Notice that it is a constant.
> >> We need the value in other places, see the last patch.
> >> 
> >> Otherwise, I have to pick random numbers like ... 50ms.
> >> 
> >> Later, Juan.
> >
> > If you need a random number, use a random number :)
> 
> It is not random, it is the bigger number that lets the timer to run at
> each 100ms O:-)

The fact that buffered_file uses a timer internally is an implementation
detail: arch_init uses the qemu file abstraction.

> I tried 100ms and 75ms, they got better throughput, but I got still
> stalls of 1.5s.
> 
> I could do some search between 50 and 75ms, but the advantage in
> bandwidth is not so big to complicate things.  (150 vs 125s for total
> migration time or so, and system is quite irresponsive with the bigger
> value).
> 
> Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <4CF46012.2060804@codemonkey.ws>]

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
       [not found]       ` <4CF46012.2060804@codemonkey.ws>
@ 2010-11-30 11:56         ` Juan Quintela
  2010-11-30 14:02           ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 11:56 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/24/2010 04:52 AM, Juan Quintela wrote:
>> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
>>    
>>> On Wed, Nov 24, 2010 at 12:02:59AM +0100, Juan Quintela wrote:
>>>      
>>>> From: Juan Quintela<quintela@trasno.org>
>>>>        
>>    
>>>> diff --git a/buffered_file.h b/buffered_file.h
>>>> index 98d358b..a728316 100644
>>>> --- a/buffered_file.h
>>>> +++ b/buffered_file.h
>>>> @@ -21,6 +21,8 @@ typedef void (BufferedPutReadyFunc)(void *opaque);
>>>>   typedef void (BufferedWaitForUnfreezeFunc)(void *opaque);
>>>>   typedef int (BufferedCloseFunc)(void *opaque);
>>>>
>>>> +extern const int buffered_file_interval;
>>>> +
>>>>        
>>> This shouldn't be exported, IMO.
>>>      
>> What do you want?  an accessor function?  Notice that it is a constant.
>> We need the value in other places, see the last patch.
>>    
>
> That one's just as wrong as this one.  TBH, this whole series is
> fundamentally the wrong approach because it's all ad hoc heuristics
> benchmarked against one workload.

No, I benchmarked against two workloads:
a- idle guest (because it was faster to test)
b- busy guest (each test takes forever, that is the reason that I tested
last).

So, I don't agree with that.

> There are three fundamental problems: 1) kvm.ko dirty bit tracking
> doesn't scale

Fully agree, but this patch don't took that.

> 2) we lose flow control information because of the
> multiple levels of buffering which means we move more data than we
> should move

Fully agree here, but this is a "massive change" to fix it correctly.

> 3) migration prevents a guest from executing the device
> model because of qemu_mutex.

This is a different problem.

> Those are the problems to fix.

This still don't fix the stalls on the main_loop.

So, you are telling me, there are this list of problems that you need to
fix.  They are not enough to fix the problem, and their imply massive
changes.

In the middle time, everybody in stable and 0.14 is not going to be able
to use migration with more than 2GB/4GB guest.

>  Sprinkling the code with returns in
> semi-random places because it benchmarked well for one particular test
> case is something we'll deeply regret down the road.

This was mean :(

There are two returns and one heuristic.

- return a) we try to migrate when we know that there is no space,
  obvious optimazation/bug (deppends on how to look at it).

- return b) we don't need to handle TLB bitmap code for kvm.  I fully
  agree that we need to split the bitmaps in something more sensible,
  but change is quite invasible, and simple fix works for the while.

- heuristic:  if you really think that an io_handler should be able to
  stall the main loop for almost 4 seconds, sorry, I don't agree.
  Again, we can fix this much better, but it needs lots of changes:
   * I asked you if there is a "maximum" value that one io_handler can
     hog the main loop.  Answer: dunno/deppends.
   * Long term: migration should be its own thread, have I told thread?
     This is qemu, no thread.
   * qemu_mutex: this is the holy grail, if we drop qemu_mutex in
     ram_save_live(), we left the guest cpu's to continue running.
     But, and this is a big but, we still stuck the main_loop for 4
     seconds, so this solution is at least partial.

And that is it.  To improve things, I receive complains left and right
that exporting buffered_file_period/timeout is BAD, very BAD.  Because
that exposes buffered_file.c internal state.

But I am given the suggestion that I should just create a
buffered_file_write_nop() that just increases the number of bytes
transferred for normal pages.  I agree that this "could" also work, but
that is indeed worse in the sense that we are exposing yet more
internals of buffered_file.c.  In the 1st case, we are only exposing the
periof of a timer, in the second, we are hijaking how
qemu_file_rate_limit() works + how we account for written memory + how
it calculates the ammount of staff that it have sent, and with this we
"knownly" lie about how much stuff we have write.

How this can be "cleaner" than a timeout of 50ms is beyond me.

On the other hand, this is the typical case where you need a lot of
testing for each change that you did.  I thought several times that I
had found the "guilty" bit for the stalls, and no luck, there was yet
another problem.

I also thought at points, ok, I can now drop the previous attempts, and
no, that didn't work either.  This was the minimal set of patches able
to fix the stalls.

This is just very depressing.

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 11:56         ` Juan Quintela
@ 2010-11-30 14:02           ` Anthony Liguori
  2010-11-30 14:11             ` Michael S. Tsirkin
  2010-11-30 15:40             ` Juan Quintela
  0 siblings, 2 replies; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 14:02 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Michael S. Tsirkin

On 11/30/2010 05:56 AM, Juan Quintela wrote:
> No, I benchmarked against two workloads:
> a- idle guest (because it was faster to test)
> b- busy guest (each test takes forever, that is the reason that I tested
> last).
>
> So, I don't agree with that.
>    

But in both cases, it's a large memory guest where the RSS size is <<< 
than the allocated memory size.  This is simply an unrealistic 
scenario.  Migrating immediately after launch may be common among 
developers but users typically run useful workloads in their guests and 
run migration after the guest has been running for quite some time.

So the scenario I'm much more interested in, is trying to migrate a 
400GB guest after the guest has been doing useful work and has brought 
in most of it's RSS.  Also with a box that big, you're going to be on 
10gbit.

That's going to introduce a whole other set of problems that this series 
is potentially just going to exacerbate even more.

>> There are three fundamental problems: 1) kvm.ko dirty bit tracking
>> doesn't scale
>>      
> Fully agree, but this patch don't took that.
>
>    
>> 2) we lose flow control information because of the
>> multiple levels of buffering which means we move more data than we
>> should move
>>      
> Fully agree here, but this is a "massive change" to fix it correctly.
>    

It's really not massive.

>> 3) migration prevents a guest from executing the device
>> model because of qemu_mutex.
>>      
> This is a different problem.
>    

No, this is really the fundamental one.

>> Those are the problems to fix.
>>      
> This still don't fix the stalls on the main_loop.
>
> So, you are telling me, there are this list of problems that you need to
> fix.  They are not enough to fix the problem, and their imply massive
> changes.
>
> In the middle time, everybody in stable and 0.14 is not going to be able
> to use migration with more than 2GB/4GB guest.
>    

That's simply not true.  You started this series with a statement that 
migration is broken.  It's not, it works perfectly fine.  Migration with 
8GB guests work perfectly fine.

You've identified a corner case for which we have suboptimal behavior, 
and are now declaring that migration is "totally broken".

>>   Sprinkling the code with returns in
>> semi-random places because it benchmarked well for one particular test
>> case is something we'll deeply regret down the road.
>>      
> This was mean :(
>    

It wasn't intended to be mean but it is the truth.  We need to approach 
these sort of problems in a more systematic way.  Systematic means 
identifying what the fundamental problems are and fixing them in a 
proper way.

Throwing a magic number in the iteration path after which we let the 
main loop run for a little bit is simply not a solution.  You're still 
going to get main loop starvation.

> There are two returns and one heuristic.
>
> - return a) we try to migrate when we know that there is no space,
>    obvious optimazation/bug (deppends on how to look at it).
>
> - return b) we don't need to handle TLB bitmap code for kvm.  I fully
>    agree that we need to split the bitmaps in something more sensible,
>    but change is quite invasible, and simple fix works for the while.
>
> - heuristic:  if you really think that an io_handler should be able to
>    stall the main loop for almost 4 seconds, sorry, I don't agree.
>    

But fundamentally, why would an iteration of migration take 4 seconds?  
This is really my fundamental object, the migration loop should, at most 
take as much time as it takes to fill up an empty socket buffer or until 
it hits the bandwidth limit.

The bandwidth limit offers a fine-grain control over exactly how long it 
should take.

If we're burning excess CPU walking a 100MB bitmap, then let's fix that 
problem.  Stopping every 1MB worth of the bitmap to do other work just 
papers over the real problem (that we're walking 100MB bitmap).

>    Again, we can fix this much better, but it needs lots of changes:
>     * I asked you if there is a "maximum" value that one io_handler can
>       hog the main loop.  Answer: dunno/deppends.
>     * Long term: migration should be its own thread, have I told thread?
>       This is qemu, no thread.
>     * qemu_mutex: this is the holy grail, if we drop qemu_mutex in
>       ram_save_live(), we left the guest cpu's to continue running.
>       But, and this is a big but, we still stuck the main_loop for 4
>       seconds, so this solution is at least partial.
>    

It really shouldn't be that hard at all to add a multi-level tree to 
kvm.ko and it fixes a bunch of problems at once.  Likewise, accounting 
zero pages is very simple and it makes bandwidth limiting more effective.

> And that is it.  To improve things, I receive complains left and right
> that exporting buffered_file_period/timeout is BAD, very BAD.  Because
> that exposes buffered_file.c internal state.
>
> But I am given the suggestion that I should just create a
> buffered_file_write_nop() that just increases the number of bytes
> transferred for normal pages.  I agree that this "could" also work, but
> that is indeed worse in the sense that we are exposing yet more
> internals of buffered_file.c.  In the 1st case, we are only exposing the
> periof of a timer, in the second, we are hijaking how
> qemu_file_rate_limit() works + how we account for written memory + how
> it calculates the ammount of staff that it have sent, and with this we
> "knownly" lie about how much stuff we have write.
>
> How this can be "cleaner" than a timeout of 50ms is beyond me.
>    

If the migration loop is running for 50ms than we've already failed if 
the user requests a 30ms downtime for migration.  The 50ms timeout means 
that a VCPU can be blocked for 50ms at a time.

Regards,

Anthony Liguori

> On the other hand, this is the typical case where you need a lot of
> testing for each change that you did.  I thought several times that I
> had found the "guilty" bit for the stalls, and no luck, there was yet
> another problem.
>
> I also thought at points, ok, I can now drop the previous attempts, and
> no, that didn't work either.  This was the minimal set of patches able
> to fix the stalls.
>
> This is just very depressing.
>
> Later, Juan.
>    

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 14:02           ` Anthony Liguori
@ 2010-11-30 14:11             ` Michael S. Tsirkin
  2010-11-30 14:22               ` Anthony Liguori
  2010-11-30 15:40             ` Juan Quintela
  1 sibling, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-30 14:11 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Juan Quintela

On Tue, Nov 30, 2010 at 08:02:56AM -0600, Anthony Liguori wrote:
> If we're burning excess CPU walking a 100MB bitmap, then let's fix
> that problem.  Stopping every 1MB worth of the bitmap to do other
> work just papers over the real problem (that we're walking 100MB
> bitmap).

Just using a bit per page will already make the problem 8x smaller,
won't it?  Is that enough to avoid the need for a new kernel interface?

-- 
MST

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 14:11             ` Michael S. Tsirkin
@ 2010-11-30 14:22               ` Anthony Liguori
  0 siblings, 0 replies; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 14:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, Juan Quintela

On 11/30/2010 08:11 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 30, 2010 at 08:02:56AM -0600, Anthony Liguori wrote:
>    
>> If we're burning excess CPU walking a 100MB bitmap, then let's fix
>> that problem.  Stopping every 1MB worth of the bitmap to do other
>> work just papers over the real problem (that we're walking 100MB
>> bitmap).
>>      
> Just using a bit per page will already make the problem 8x smaller,
> won't it?  Is that enough to avoid the need for a new kernel interface?
>    

I don't know, but it's definitely a good thing to try.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 14:02           ` Anthony Liguori
  2010-11-30 14:11             ` Michael S. Tsirkin
@ 2010-11-30 15:40             ` Juan Quintela
  2010-11-30 16:10               ` Michael S. Tsirkin
  1 sibling, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 15:40 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/30/2010 05:56 AM, Juan Quintela wrote:
>> No, I benchmarked against two workloads:
>> a- idle guest (because it was faster to test)
>> b- busy guest (each test takes forever, that is the reason that I tested
>> last).
>>
>> So, I don't agree with that.
>>    
>
> But in both cases, it's a large memory guest where the RSS size is <<<
> than the allocated memory size.  This is simply an unrealistic
> scenario.  Migrating immediately after launch may be common among
> developers but users typically run useful workloads in their guests
> and run migration after the guest has been running for quite some
> time.
>
> So the scenario I'm much more interested in, is trying to migrate a
> 400GB guest after the guest has been doing useful work and has brought
> in most of it's RSS.  Also with a box that big, you're going to be on
> 10gbit.
>
> That's going to introduce a whole other set of problems that this
> series is potentially just going to exacerbate even more.
>
>>> There are three fundamental problems: 1) kvm.ko dirty bit tracking
>>> doesn't scale
>>>      
>> Fully agree, but this patch don't took that.
>>
>>    
>>> 2) we lose flow control information because of the
>>> multiple levels of buffering which means we move more data than we
>>> should move
>>>      
>> Fully agree here, but this is a "massive change" to fix it correctly.
>>    
>
> It's really not massive.
>
>>> 3) migration prevents a guest from executing the device
>>> model because of qemu_mutex.
>>>      
>> This is a different problem.
>>    
>
> No, this is really the fundamental one.

We have more problems on the main_loop.

>>> Those are the problems to fix.
>>>      
>> This still don't fix the stalls on the main_loop.
>>
>> So, you are telling me, there are this list of problems that you need to
>> fix.  They are not enough to fix the problem, and their imply massive
>> changes.
>>
>> In the middle time, everybody in stable and 0.14 is not going to be able
>> to use migration with more than 2GB/4GB guest.
>>    
>
> That's simply not true.  You started this series with a statement that
> migration is broken.

Well, it deppends who you ask.

>   It's not, it works perfectly fine.  Migration
> with 8GB guests work perfectly fine.

It dont' work perfectly fine.  If you have a migrate_maximum_downtime of
30ms and we are stuck for 1s several times, it is broken on my book.
This was an idle guest.

Now move to having a loaded guest (more than we can migrate).  And then
we get stalls of almost a minute where nothing happens.

That is a "very strange" definition of it works perfectly.  In my book
it is nearer the "it is completely broken".

> You've identified a corner case for which we have suboptimal behavior,
> and are now declaring that migration is "totally broken".

I think it is not a corner case, but it deppendes of what your "normal"
case is.

>>>   Sprinkling the code with returns in
>>> semi-random places because it benchmarked well for one particular test
>>> case is something we'll deeply regret down the road.
>>>      
>> This was mean :(
>>    
>
> It wasn't intended to be mean but it is the truth.  We need to
> approach these sort of problems in a more systematic way.  Systematic
> means identifying what the fundamental problems are and fixing them in
> a proper way.

It is not a corner case, it is "always" that we have enough memory.
Basically our bitmap handling code is "exponential" on memory size, so
the bigger the amount of memory the bigger the problems appear.

> Throwing a magic number in the iteration path after which we let the
> main loop run for a little bit is simply not a solution.  You're still
> going to get main loop starvation.

This deppends of how you approach the problem.   We have io_handlers
that take too much time (for various definitios of "too much").

>> There are two returns and one heuristic.
>>
>> - return a) we try to migrate when we know that there is no space,
>>    obvious optimazation/bug (deppends on how to look at it).
>>
>> - return b) we don't need to handle TLB bitmap code for kvm.  I fully
>>    agree that we need to split the bitmaps in something more sensible,
>>    but change is quite invasible, and simple fix works for the while.
>>
>> - heuristic:  if you really think that an io_handler should be able to
>>    stall the main loop for almost 4 seconds, sorry, I don't agree.
>>    
>
> But fundamentally, why would an iteration of migration take 4 seconds?

That is a good question.  It shouldn't.

> This is really my fundamental object, the migration loop should, at
> most take as much time as it takes to fill up an empty socket buffer
> or until it hits the bandwidth limit.

bandwidth limit is not hit with zero pages.

> The bandwidth limit offers a fine-grain control over exactly how long
> it should take.

No if we are not sending data.

> If we're burning excess CPU walking a 100MB bitmap, then let's fix
> that problem.  Stopping every 1MB worth of the bitmap to do other work
> just papers over the real problem (that we're walking 100MB bitmap).

Agreed.

OK.  I am going to state this other way.  We have at least this
problems:
- qemu/kvm bitmap handler
- io_handler takes too much memory
- interface witch kvm
- bandwidth calculation
- zero page optimazations
- qemu_mutex handling.

If I sent a patch that fixes only "some" of those, I am going to be
asked for numbers (because problem would still be there).

Are you telling me that if I sent a patch that fixes any of the
individual problems, but for the user there is still the same problem
(due to the other ones) they are going to be accepted?

I tried to get the minimal set of patches that fixes the "real" problem:
i.e. stalls.  Half of them are obvious and you agreed to accept them.

But accepting them, we haven't fixed the problem.  So, what is the next
step?
- getting the problems fixed one by one: would that patches get
  integrated?  notice that stalls/numbers will not improve until last
  problem is fixed.
- nothing will get integrated until everything is fixed?

My idea was to go the other way around.  Get the minimal fixes that are
needed to fix the real problem, and then change the implementation of
the things that I showed before.  Why, because this way we know that
problem is fixed (not the more elegant solution, but it is fixed), and
we can measure that we are not getting problems back with each change.

Testing each change in isolation don't allow us to measure that we are
not re-introducing the problem.

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 15:40             ` Juan Quintela
@ 2010-11-30 16:10               ` Michael S. Tsirkin
  2010-11-30 16:32                 ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-30 16:10 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
> Basically our bitmap handling code is "exponential" on memory size,

I didn't realize this. What makes it exponential?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 16:10               ` Michael S. Tsirkin
@ 2010-11-30 16:32                 ` Juan Quintela
  2010-11-30 16:44                   ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 16:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
>> Basically our bitmap handling code is "exponential" on memory size,
>
> I didn't realize this. What makes it exponential?

Well, 1st of all, it is "exponential" as you measure it.

stalls by default are:

1-2GB: milliseconds
2-4GB: 100-200ms
4-8GB: 1s
64GB: 59s
400GB: 24m (yes, minutes)

That sounds really exponential.

Now the other thing is the cache size.

with 64GB of RAM, we basically have a 16MB bitmap size, i.e. we blow the
cache each time that we run ram_save_live().

This is one of the reasons why I don't want to walk the bitmap on
ram_save_remaining().

ram_save_remaining() is linear with memory size (previous my changes).
ram_save_live is also linear on the memory size, so we are in a worse
case of n^n  (notice that this is the worst case, we normally do much
better n^2, n^3 or so).

Add to this that we are blowing the caches with big amounts of memory
(we don't do witch smaller sizes), and you can see that our behaviour is
clearly exponential.

As I need to fix them, I will work on them today/tomorrow.

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 16:32                 ` Juan Quintela
@ 2010-11-30 16:44                   ` Anthony Liguori
  2010-11-30 18:04                     ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 16:44 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Michael S. Tsirkin

On 11/30/2010 10:32 AM, Juan Quintela wrote:
> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
>    
>> On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
>>      
>>> Basically our bitmap handling code is "exponential" on memory size,
>>>        
>> I didn't realize this. What makes it exponential?
>>      
> Well, 1st of all, it is "exponential" as you measure it.
>
> stalls by default are:
>
> 1-2GB: milliseconds
> 2-4GB: 100-200ms
> 4-8GB: 1s
> 64GB: 59s
> 400GB: 24m (yes, minutes)
>
> That sounds really exponential.
>    

How are you measuring stalls btw?

Regards,

Anthony Liguori

> Now the other thing is the cache size.
>
> with 64GB of RAM, we basically have a 16MB bitmap size, i.e. we blow the
> cache each time that we run ram_save_live().
>
> This is one of the reasons why I don't want to walk the bitmap on
> ram_save_remaining().
>
> ram_save_remaining() is linear with memory size (previous my changes).
> ram_save_live is also linear on the memory size, so we are in a worse
> case of n^n  (notice that this is the worst case, we normally do much
> better n^2, n^3 or so).
>
> Add to this that we are blowing the caches with big amounts of memory
> (we don't do witch smaller sizes), and you can see that our behaviour is
> clearly exponential.
>
> As I need to fix them, I will work on them today/tomorrow.
>
> Later, Juan.
>    

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 16:44                   ` Anthony Liguori
@ 2010-11-30 18:04                     ` Juan Quintela
  2010-11-30 18:54                       ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 18:04 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/30/2010 10:32 AM, Juan Quintela wrote:
>> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
>>    
>>> On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
>>>      
>>>> Basically our bitmap handling code is "exponential" on memory size,
>>>>        
>>> I didn't realize this. What makes it exponential?
>>>      
>> Well, 1st of all, it is "exponential" as you measure it.
>>
>> stalls by default are:
>>
>> 1-2GB: milliseconds
>> 2-4GB: 100-200ms
>> 4-8GB: 1s
>> 64GB: 59s
>> 400GB: 24m (yes, minutes)
>>
>> That sounds really exponential.
>>    
>
> How are you measuring stalls btw?

At the end of the ram_save_live().  This was the reason that I put the
information there.

for the 24mins stall (I don't have that machine anymore) I had less
"exact" measurements.  It was the amount that it "decided" to sent in
the last non-live part of memory migration.  With the stalls & zero page
account, we just got to the point where we had basically infinity speed.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> Now the other thing is the cache size.
>>
>> with 64GB of RAM, we basically have a 16MB bitmap size, i.e. we blow the
>> cache each time that we run ram_save_live().
>>
>> This is one of the reasons why I don't want to walk the bitmap on
>> ram_save_remaining().
>>
>> ram_save_remaining() is linear with memory size (previous my changes).
>> ram_save_live is also linear on the memory size, so we are in a worse
>> case of n^n  (notice that this is the worst case, we normally do much
>> better n^2, n^3 or so).
>>
>> Add to this that we are blowing the caches with big amounts of memory
>> (we don't do witch smaller sizes), and you can see that our behaviour is
>> clearly exponential.
>>
>> As I need to fix them, I will work on them today/tomorrow.
>>
>> Later, Juan.
>>    

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 18:04                     ` Juan Quintela
@ 2010-11-30 18:54                       ` Anthony Liguori
  2010-11-30 19:15                         ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 18:54 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Michael S. Tsirkin

On 11/30/2010 12:04 PM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 11/30/2010 10:32 AM, Juan Quintela wrote:
>>      
>>> "Michael S. Tsirkin"<mst@redhat.com>   wrote:
>>>
>>>        
>>>> On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
>>>>
>>>>          
>>>>> Basically our bitmap handling code is "exponential" on memory size,
>>>>>
>>>>>            
>>>> I didn't realize this. What makes it exponential?
>>>>
>>>>          
>>> Well, 1st of all, it is "exponential" as you measure it.
>>>
>>> stalls by default are:
>>>
>>> 1-2GB: milliseconds
>>> 2-4GB: 100-200ms
>>> 4-8GB: 1s
>>> 64GB: 59s
>>> 400GB: 24m (yes, minutes)
>>>
>>> That sounds really exponential.
>>>
>>>        
>> How are you measuring stalls btw?
>>      
> At the end of the ram_save_live().  This was the reason that I put the
> information there.
>
> for the 24mins stall (I don't have that machine anymore) I had less
> "exact" measurements.  It was the amount that it "decided" to sent in
> the last non-live part of memory migration.  With the stalls&  zero page
> account, we just got to the point where we had basically infinity speed.
>    

That's not quite guest visible.

It only is a "stall" if the guest is trying to access device emulation 
and acquiring the qemu_mutex.  A more accurate measurement would be 
something that measured guest availability.  For instance, I tight loop 
of while (1) { usleep(100); gettimeofday(); } that then recorded periods 
of unavailability > X.

Of course, it's critically important that a working version of pvclock 
be available int he guest for this to be accurate.

Regards,

Anthony Liguori

> Later, Juan.
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>> Now the other thing is the cache size.
>>>
>>> with 64GB of RAM, we basically have a 16MB bitmap size, i.e. we blow the
>>> cache each time that we run ram_save_live().
>>>
>>> This is one of the reasons why I don't want to walk the bitmap on
>>> ram_save_remaining().
>>>
>>> ram_save_remaining() is linear with memory size (previous my changes).
>>> ram_save_live is also linear on the memory size, so we are in a worse
>>> case of n^n  (notice that this is the worst case, we normally do much
>>> better n^2, n^3 or so).
>>>
>>> Add to this that we are blowing the caches with big amounts of memory
>>> (we don't do witch smaller sizes), and you can see that our behaviour is
>>> clearly exponential.
>>>
>>> As I need to fix them, I will work on them today/tomorrow.
>>>
>>> Later, Juan.
>>>
>>>        

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 18:54                       ` Anthony Liguori
@ 2010-11-30 19:15                         ` Juan Quintela
  2010-11-30 20:23                           ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 19:15 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/30/2010 12:04 PM, Juan Quintela wrote:
>> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>>    
>>> On 11/30/2010 10:32 AM, Juan Quintela wrote:
>>>      
>>>> "Michael S. Tsirkin"<mst@redhat.com>   wrote:
>>>>
>>>>        
>>>>> On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
>>>>>
>>>>>          
>>>>>> Basically our bitmap handling code is "exponential" on memory size,
>>>>>>
>>>>>>            
>>>>> I didn't realize this. What makes it exponential?
>>>>>
>>>>>          
>>>> Well, 1st of all, it is "exponential" as you measure it.
>>>>
>>>> stalls by default are:
>>>>
>>>> 1-2GB: milliseconds
>>>> 2-4GB: 100-200ms
>>>> 4-8GB: 1s
>>>> 64GB: 59s
>>>> 400GB: 24m (yes, minutes)
>>>>
>>>> That sounds really exponential.
>>>>
>>>>        
>>> How are you measuring stalls btw?
>>>      
>> At the end of the ram_save_live().  This was the reason that I put the
>> information there.
>>
>> for the 24mins stall (I don't have that machine anymore) I had less
>> "exact" measurements.  It was the amount that it "decided" to sent in
>> the last non-live part of memory migration.  With the stalls&  zero page
>> account, we just got to the point where we had basically infinity speed.
>>    
>
> That's not quite guest visible.

Humm, guest don't answer in 24mins
monitor don't answer in 24mins
ping don't answer in 24mins

are you sure that this is not visible?  Bug report put that guest had
just died, it was me who waited to see that it took 24mins to end.

> It only is a "stall" if the guest is trying to access device emulation
> and acquiring the qemu_mutex.  A more accurate measurement would be
> something that measured guest availability.  For instance, I tight
> loop of while (1) { usleep(100); gettimeofday(); } that then recorded
> periods of unavailability > X.

This is better, and this is what qemu_mutex change should fix.

> Of course, it's critically important that a working version of pvclock
> be available int he guest for this to be accurate.

If the problem are 24mins, we don't need such an "exact" version O:-)

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 19:15                         ` Juan Quintela
@ 2010-11-30 20:23                           ` Anthony Liguori
  2010-11-30 20:56                             ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 20:23 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Michael S. Tsirkin

On 11/30/2010 01:15 PM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 11/30/2010 12:04 PM, Juan Quintela wrote:
>>      
>>> Anthony Liguori<anthony@codemonkey.ws>   wrote:
>>>
>>>        
>>>> On 11/30/2010 10:32 AM, Juan Quintela wrote:
>>>>
>>>>          
>>>>> "Michael S. Tsirkin"<mst@redhat.com>    wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> On Tue, Nov 30, 2010 at 04:40:41PM +0100, Juan Quintela wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Basically our bitmap handling code is "exponential" on memory size,
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> I didn't realize this. What makes it exponential?
>>>>>>
>>>>>>
>>>>>>              
>>>>> Well, 1st of all, it is "exponential" as you measure it.
>>>>>
>>>>> stalls by default are:
>>>>>
>>>>> 1-2GB: milliseconds
>>>>> 2-4GB: 100-200ms
>>>>> 4-8GB: 1s
>>>>> 64GB: 59s
>>>>> 400GB: 24m (yes, minutes)
>>>>>
>>>>> That sounds really exponential.
>>>>>
>>>>>
>>>>>            
>>>> How are you measuring stalls btw?
>>>>
>>>>          
>>> At the end of the ram_save_live().  This was the reason that I put the
>>> information there.
>>>
>>> for the 24mins stall (I don't have that machine anymore) I had less
>>> "exact" measurements.  It was the amount that it "decided" to sent in
>>> the last non-live part of memory migration.  With the stalls&   zero page
>>> account, we just got to the point where we had basically infinity speed.
>>>
>>>        
>> That's not quite guest visible.
>>      
> Humm, guest don't answer in 24mins
> monitor don't answer in 24mins
> ping don't answer in 24mins
>
> are you sure that this is not visible?  Bug report put that guest had
> just died, it was me who waited to see that it took 24mins to end.
>    

I'm extremely sceptical that any of your patches would address this 
problem.  Even if you had to scan every page in a 400GB guest, it would 
not take 24 minutes.   Something is not quite right here.

24 minutes suggests that there's another problem that is yet to be 
identified.

Regards,

Anthony Liguori

>> It only is a "stall" if the guest is trying to access device emulation
>> and acquiring the qemu_mutex.  A more accurate measurement would be
>> something that measured guest availability.  For instance, I tight
>> loop of while (1) { usleep(100); gettimeofday(); } that then recorded
>> periods of unavailability>  X.
>>      
> This is better, and this is what qemu_mutex change should fix.
>
>    
>> Of course, it's critically important that a working version of pvclock
>> be available int he guest for this to be accurate.
>>      
> If the problem are 24mins, we don't need such an "exact" version O:-)
>
> Later, Juan.
>    

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
  2010-11-30 20:23                           ` Anthony Liguori
@ 2010-11-30 20:56                             ` Juan Quintela
  0 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 20:56 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Michael S. Tsirkin

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/30/2010 01:15 PM, Juan Quintela wrote:

>>>> At the end of the ram_save_live().  This was the reason that I put the
>>>> information there.
>>>>
>>>> for the 24mins stall (I don't have that machine anymore) I had less
>>>> "exact" measurements.  It was the amount that it "decided" to sent in
>>>> the last non-live part of memory migration.  With the stalls&   zero page
>>>> account, we just got to the point where we had basically infinity speed.
>>>>
>>>>        
>>> That's not quite guest visible.
>>>      
>> Humm, guest don't answer in 24mins
>> monitor don't answer in 24mins
>> ping don't answer in 24mins
>>
>> are you sure that this is not visible?  Bug report put that guest had
>> just died, it was me who waited to see that it took 24mins to end.
>>    
>
> I'm extremely sceptical that any of your patches would address this
> problem.  Even if you had to scan every page in a 400GB guest, it
> would not take 24 minutes.   Something is not quite right here.
>
> 24 minutes suggests that there's another problem that is yet to be
> identified.

I haven't tested the lastest versions of the patches sent to the list,
but a previous version fixed that problem.

If I get my hands on the machine will try to reproduce the problem and
measure things.

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 03/10] Add printf debug to savevm
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
  2010-11-23 23:02 ` [Qemu-devel] [PATCH 01/10] Add spent time to migration Juan Quintela
  2010-11-23 23:02 ` [Qemu-devel] [PATCH 02/10] Add buffered_file_internal constant Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
       [not found]   ` <4CF45AB2.7050506@codemonkey.ws>
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 04/10] No need to iterate if we already are over the limit Juan Quintela
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

Once there, print all sections that take more than 100ms to migrate.
buffered file runs a timer at that 100ms interval.

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 savevm.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/savevm.c b/savevm.c
index 4e49765..ceed6de 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,24 @@
 #include "migration.h"
 #include "qemu_socket.h"
 #include "qemu-queue.h"
+#include "buffered_file.h"
+
+//#define DEBUG_SAVEVM
+
+#ifdef DEBUG_SAVEVM
+#define DPRINTF(fmt, ...) \
+    do { printf("savevm: " fmt, ## __VA_ARGS__); } while (0)
+static int64_t start, stop;
+#define START_SAVEVM_CLOCK()	do { start = qemu_get_clock(rt_clock); } while (0)
+#define STOP_SAVEVM_CLOCK() \
+	do { stop = qemu_get_clock(rt_clock) - start; \
+	} while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#define START_SAVEVM_CLOCK()  do { } while (0)
+#define STOP_SAVEVM_CLOCK()   do { } while (0)
+#endif

 #define SELF_ANNOUNCE_ROUNDS 5

@@ -1480,11 +1498,23 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
         if (se->save_live_state == NULL)
             continue;

+        START_SAVEVM_CLOCK();
+
         /* Section type */
         qemu_put_byte(f, QEMU_VM_SECTION_PART);
         qemu_put_be32(f, se->section_id);

         ret = se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
+        STOP_SAVEVM_CLOCK();
+#ifdef DEBUG_SAVEVM
+	if (stop > buffered_file_interval) {
+		/* buffered_file run a timer at 100ms */
+		static int times_missing = 1;
+		DPRINTF("save live iterate section id %u name %s took %ld milliseconds %u times\n",
+			se->section_id,	se->idstr, stop, times_missing++);
+	}
+#endif
+
         if (!ret) {
             /* Do not proceed to the next vmstate before this one reported
                completion of the current stage. This serializes the migration
@@ -1516,13 +1546,18 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
         if (se->save_live_state == NULL)
             continue;

+        START_SAVEVM_CLOCK();
         /* Section type */
         qemu_put_byte(f, QEMU_VM_SECTION_END);
         qemu_put_be32(f, se->section_id);

         se->save_live_state(mon, f, QEMU_VM_SECTION_END, se->opaque);
+        STOP_SAVEVM_CLOCK();
+        DPRINTF("save live end section id %u name %s took %ld milliseconds\n",
+		se->section_id, se->idstr, stop);
     }

+    START_SAVEVM_CLOCK();
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         int len;

@@ -1542,12 +1577,14 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
         qemu_put_be32(f, se->version_id);

         r = vmstate_save(f, se);
+        DPRINTF("save section id %u name %s\n", se->section_id, se->idstr);
         if (r < 0) {
             monitor_printf(mon, "cannot migrate with device '%s'\n", se->idstr);
             return r;
         }
     }
-
+    STOP_SAVEVM_CLOCK();
+    DPRINTF("save devices took %ld milliseconds\n", stop);
     qemu_put_byte(f, QEMU_VM_EOF);

     if (qemu_file_has_error(f))
@@ -1746,8 +1783,11 @@ int qemu_loadvm_state(QEMUFile *f)
             le->section_id = section_id;
             le->version_id = version_id;
             QLIST_INSERT_HEAD(&loadvm_handlers, le, entry);
-
+            START_SAVEVM_CLOCK();
             ret = vmstate_load(f, le->se, le->version_id);
+            STOP_SAVEVM_CLOCK();
+            DPRINTF("load section id %u name %s took %ld milliseconds\n", le->section_id,
+		    le->se->idstr, stop);
             if (ret < 0) {
                 fprintf(stderr, "qemu: warning: error while loading state for instance 0x%x of device '%s'\n",
                         instance_id, idstr);
@@ -1769,7 +1809,11 @@ int qemu_loadvm_state(QEMUFile *f)
                 goto out;
             }

+            START_SAVEVM_CLOCK();
             ret = vmstate_load(f, le->se, le->version_id);
+            STOP_SAVEVM_CLOCK();
+            DPRINTF("load section id %u name %s took %ld milliseconds\n", le->section_id,
+		    le->se->idstr, stop);
             if (ret < 0) {
                 fprintf(stderr, "qemu: warning: error while loading state section id %d\n",
                         section_id);
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

[parent not found: <4CF45AB2.7050506@codemonkey.ws>]

* Re: [Qemu-devel] [PATCH 03/10] Add printf debug to savevm
       [not found]   ` <4CF45AB2.7050506@codemonkey.ws>
@ 2010-11-30 10:36     ` Stefan Hajnoczi
  2010-11-30 22:40       ` [Qemu-devel] " Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Stefan Hajnoczi @ 2010-11-30 10:36 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Anthony Liguori, qemu-devel, Juan Quintela

On Mon, Nov 29, 2010 at 08:00:18PM -0600, Anthony Liguori wrote:
> Yeah, all of this should be done via tracing.  Maybe Stefan can make
> some suggestions.

Here is an example of how to record savevm timestamps using tracing.
Actually the timestamp is recorded automatically when a trace event
fires.

Add these to the trace-events file:
savevm_start(void) ""
savevm_stop(unsigned int section_id) "section_id %u"

Then use trace_savevm_start() instead of START_SAVEVM_CLOCK() and
trace_savevm_stop() instead of STOP_SAVEVM_CLOCK() in savevm.c.  Also
#include "trace.h".

All the macros and inline timestamp analysis can be removed from
savevm.c.

./configure --trace-backend=simple [...]
make

After running savevm look for the trace-<pid> file that QEMU produces in
its current working directory.  You can pretty-print it like this:
./simpletrace.py trace-events trace-<pid>

The second field in the simpletrace.py output is the time delta (in
microseconds) since the last trace event.  So you can easily see how
long start->stop took.

For more info see docs/tracing.txt.  You might prefer to use SystemTap
(./configure --trace-backend=dtrace) so you can write stap scripts to do
more powerful analysis.

Stefan

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 03/10] Add printf debug to savevm
  2010-11-30 10:36     ` Stefan Hajnoczi
@ 2010-11-30 22:40       ` Juan Quintela
  2010-12-01  7:50         ` Stefan Hajnoczi
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 22:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Anthony Liguori, qemu-devel

Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> wrote:
> On Mon, Nov 29, 2010 at 08:00:18PM -0600, Anthony Liguori wrote:
>> Yeah, all of this should be done via tracing.  Maybe Stefan can make
>> some suggestions.
>
> Here is an example of how to record savevm timestamps using tracing.
> Actually the timestamp is recorded automatically when a trace event
> fires.
>
> Add these to the trace-events file:
> savevm_start(void) ""
> savevm_stop(unsigned int section_id) "section_id %u"
>
> Then use trace_savevm_start() instead of START_SAVEVM_CLOCK() and
> trace_savevm_stop() instead of STOP_SAVEVM_CLOCK() in savevm.c.  Also
> #include "trace.h".
>
> All the macros and inline timestamp analysis can be removed from
> savevm.c.
>
> ./configure --trace-backend=simple [...]
> make
>
> After running savevm look for the trace-<pid> file that QEMU produces in
> its current working directory.  You can pretty-print it like this:
> ./simpletrace.py trace-events trace-<pid>
>
> The second field in the simpletrace.py output is the time delta (in
> microseconds) since the last trace event.  So you can easily see how
> long start->stop took.
>
> For more info see docs/tracing.txt.  You might prefer to use SystemTap
> (./configure --trace-backend=dtrace) so you can write stap scripts to do
> more powerful analysis.

Thanks.

Searching for guru, I basically want to only print the values when the
difference is bigger that some values (number of calls is really big, I
need to learn how to script the analisys).

But clearly this "have to be" easier than change printf's and ifs and
rerun the test.

Later, Juan.

> Stefan

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 03/10] Add printf debug to savevm
  2010-11-30 22:40       ` [Qemu-devel] " Juan Quintela
@ 2010-12-01  7:50         ` Stefan Hajnoczi
  0 siblings, 0 replies; 75+ messages in thread
From: Stefan Hajnoczi @ 2010-12-01  7:50 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Anthony Liguori, qemu-devel

On Tue, Nov 30, 2010 at 11:40:36PM +0100, Juan Quintela wrote:
> Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> wrote:
> > On Mon, Nov 29, 2010 at 08:00:18PM -0600, Anthony Liguori wrote:
> >> Yeah, all of this should be done via tracing.  Maybe Stefan can make
> >> some suggestions.
> >
> > Here is an example of how to record savevm timestamps using tracing.
> > Actually the timestamp is recorded automatically when a trace event
> > fires.
> >
> > Add these to the trace-events file:
> > savevm_start(void) ""
> > savevm_stop(unsigned int section_id) "section_id %u"
> >
> > Then use trace_savevm_start() instead of START_SAVEVM_CLOCK() and
> > trace_savevm_stop() instead of STOP_SAVEVM_CLOCK() in savevm.c.  Also
> > #include "trace.h".
> >
> > All the macros and inline timestamp analysis can be removed from
> > savevm.c.
> >
> > ./configure --trace-backend=simple [...]
> > make
> >
> > After running savevm look for the trace-<pid> file that QEMU produces in
> > its current working directory.  You can pretty-print it like this:
> > ./simpletrace.py trace-events trace-<pid>
> >
> > The second field in the simpletrace.py output is the time delta (in
> > microseconds) since the last trace event.  So you can easily see how
> > long start->stop took.
> >
> > For more info see docs/tracing.txt.  You might prefer to use SystemTap
> > (./configure --trace-backend=dtrace) so you can write stap scripts to do
> > more powerful analysis.
> 
> Thanks.
> 
> Searching for guru, I basically want to only print the values when the
> difference is bigger that some values (number of calls is really big, I
> need to learn how to script the analisys).

You can use awk(1) on the simpletrace.py output:
$1 ~ /savevm_stop/ {
	/* Print savevm_stop line when >100 ms duration */
	if ($2 > 100000) {
		printf("%s times_missing=%u\n", $0, times_missing++);
	}
}

Or you can use --trace-backend=dtrace and write SystemTap scripts:
http://sourceware.org/systemtap/SystemTap_Beginners_Guide/
http://sourceware.org/systemtap/langref/

Stefan

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 04/10] No need to iterate if we already are over the limit
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (2 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 03/10] Add printf debug to savevm Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 05/10] KVM don't care about TLB handling Juan Quintela
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

If buffers are full, don't iterate

Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@trasno.org>
---
 savevm.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/savevm.c b/savevm.c
index ceed6de..7c289af 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1498,6 +1498,9 @@ int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
         if (se->save_live_state == NULL)
             continue;

+        if (qemu_file_rate_limit(f))
+            return 0;
+
         START_SAVEVM_CLOCK();

         /* Section type */
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 05/10] KVM don't care about TLB handling
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (3 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 04/10] No need to iterate if we already are over the limit Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 06/10] Only calculate expected_time for stage 2 Juan Quintela
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

TLB handling is only used in TCG mode.  It is very costly for guests with lots
of memory ad lots of CPU's.

Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@trasno.org>
---
 exec.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index db9ff55..f5b2386 100644
--- a/exec.c
+++ b/exec.c
@@ -2028,6 +2028,9 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
         return;
     cpu_physical_memory_mask_dirty_range(start, length, dirty_flags);

+    if (kvm_enabled())
+        return;
+
     /* we modify the TLB cache so that the dirty bit will be set again
        when accessing the range */
     start1 = (unsigned long)qemu_get_ram_ptr(start);
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 06/10] Only calculate expected_time for stage 2
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (4 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 05/10] KVM don't care about TLB handling Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 07/10] ram_save_remaining() returns an uint64_t Juan Quintela
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch_init.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 4486925..df3d91f 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -217,7 +217,6 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     ram_addr_t addr;
     uint64_t bytes_transferred_last;
     double bwidth = 0;
-    uint64_t expected_time = 0;

     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
@@ -293,9 +292,12 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)

     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);

-    expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
-
-    return (stage == 2) && (expected_time <= migrate_max_downtime());
+    if (stage == 2) {
+	    uint64_t expected_time;
+	    expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
+	    return expected_time <= migrate_max_downtime();
+    }
+    return 0;
 }

 static inline void *host_from_stream_offset(QEMUFile *f,
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 07/10] ram_save_remaining() returns an uint64_t
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (5 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 06/10] Only calculate expected_time for stage 2 Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
       [not found]   ` <4CF45C0C.705@codemonkey.ws>
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 08/10] Count nanoseconds with uint64_t not doubles Juan Quintela
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

It returns a counter of things, not a ram address.

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch_init.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index df3d91f..9e941a0 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -173,10 +173,10 @@ static int ram_save_block(QEMUFile *f)

 static uint64_t bytes_transferred;

-static ram_addr_t ram_save_remaining(void)
+static uint64_t ram_save_remaining(void)
 {
     RAMBlock *block;
-    ram_addr_t count = 0;
+    uint64_t count = 0;

     QLIST_FOREACH(block, &ram_list.blocks, next) {
         ram_addr_t addr;
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

[parent not found: <4CF45C0C.705@codemonkey.ws>]

* [Qemu-devel] Re: [PATCH 07/10] ram_save_remaining() returns an uint64_t
       [not found]   ` <4CF45C0C.705@codemonkey.ws>
@ 2010-11-30  7:21     ` Paolo Bonzini
  2010-11-30 13:44       ` Anthony Liguori
  2010-11-30 14:38     ` Juan Quintela
  1 sibling, 1 reply; 75+ messages in thread
From: Paolo Bonzini @ 2010-11-30  7:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Juan Quintela, qemu-devel, Juan Quintela

On 11/30/2010 03:06 AM, Anthony Liguori wrote:
>>
>> -static ram_addr_t ram_save_remaining(void)
>> +static uint64_t ram_save_remaining(void)
>>   {
>>       RAMBlock *block;
>> -    ram_addr_t count = 0;
>> +    uint64_t count = 0;
>>
>>       QLIST_FOREACH(block,&ram_list.blocks, next) {
>>           ram_addr_t addr;
>
> No, it returns a count of bytes of ram which is a subset of ram_addr_t's
> space.  The unit is definitely right here.

It returns a count of pages, actually, which is a different unit.  For 
example, in practice it can fit in 32 bits even (though I'm not saying 
we should make it uint32_t; we'd be dangerously close to the limit for 
the guest sizes that Juan is handling).

Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 07/10] ram_save_remaining() returns an uint64_t
  2010-11-30  7:21     ` [Qemu-devel] " Paolo Bonzini
@ 2010-11-30 13:44       ` Anthony Liguori
  0 siblings, 0 replies; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 13:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Juan Quintela, qemu-devel, Juan Quintela

On 11/30/2010 01:21 AM, Paolo Bonzini wrote:
> On 11/30/2010 03:06 AM, Anthony Liguori wrote:
>>>
>>> -static ram_addr_t ram_save_remaining(void)
>>> +static uint64_t ram_save_remaining(void)
>>>   {
>>>       RAMBlock *block;
>>> -    ram_addr_t count = 0;
>>> +    uint64_t count = 0;
>>>
>>>       QLIST_FOREACH(block,&ram_list.blocks, next) {
>>>           ram_addr_t addr;
>>
>> No, it returns a count of bytes of ram which is a subset of ram_addr_t's
>> space.  The unit is definitely right here.
>
> It returns a count of pages, actually, which is a different unit.

Page size is just a fixed number in the ram_addr_t space.

Regards,

Anthony Liguori

>   For example, in practice it can fit in 32 bits even (though I'm not 
> saying we should make it uint32_t; we'd be dangerously close to the 
> limit for the guest sizes that Juan is handling).
>
> Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 07/10] ram_save_remaining() returns an uint64_t
       [not found]   ` <4CF45C0C.705@codemonkey.ws>
  2010-11-30  7:21     ` [Qemu-devel] " Paolo Bonzini
@ 2010-11-30 14:38     ` Juan Quintela
  1 sibling, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 14:38 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/23/2010 05:03 PM, Juan Quintela wrote:
>> From: Juan Quintela<quintela@trasno.org>
>>
>> It returns a counter of things, not a ram address.
>>
>> Signed-off-by: Juan Quintela<quintela@trasno.org>
>> Signed-off-by: Juan Quintela<quintela@redhat.com>
>> ---
>>   arch_init.c |    4 ++--
>>   1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index df3d91f..9e941a0 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -173,10 +173,10 @@ static int ram_save_block(QEMUFile *f)
>>
>>   static uint64_t bytes_transferred;
>>
>> -static ram_addr_t ram_save_remaining(void)
>> +static uint64_t ram_save_remaining(void)
>>   {
>>       RAMBlock *block;
>> -    ram_addr_t count = 0;
>> +    uint64_t count = 0;
>>
>>       QLIST_FOREACH(block,&ram_list.blocks, next) {
>>           ram_addr_t addr;
>>    
>
> No, it returns a count of bytes of ram which is a subset of
> ram_addr_t's space.  The unit is definitely right here.

I thought this would be un-controversial.  But the important part from
your sentence is "count".  ram_addr_t is an addr in my mind.

  /* address in the RAM (different from a physical address) */
  typedef unsigned long ram_addr_t;

in cpu-common.h comment, it is also an address.  But doing more grepping
I see that it is "conveniently" used as a size in other places.

Later, Juan.

> Regards,
>
> Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 08/10] Count nanoseconds with uint64_t not doubles
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (6 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 07/10] ram_save_remaining() returns an uint64_t Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
  2010-11-30  7:17   ` [Qemu-devel] " Paolo Bonzini
       [not found]   ` <4CF45C5B.9080507@codemonkey.ws>
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 09/10] Exit loop if we have been there too long Juan Quintela
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 10/10] Maintaing number of dirty pages Juan Quintela
  9 siblings, 2 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch_init.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 9e941a0..d32aaae 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -216,6 +216,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
 {
     ram_addr_t addr;
     uint64_t bytes_transferred_last;
+    uint64_t t0;
     double bwidth = 0;

     if (stage < 0) {
@@ -258,7 +259,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     }

     bytes_transferred_last = bytes_transferred;
-    bwidth = qemu_get_clock_ns(rt_clock);
+    t0 = qemu_get_clock_ns(rt_clock);

     while (!qemu_file_rate_limit(f)) {
         int bytes_sent;
@@ -270,8 +271,8 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         }
     }

-    bwidth = qemu_get_clock_ns(rt_clock) - bwidth;
-    bwidth = (bytes_transferred - bytes_transferred_last) / bwidth;
+    t0 = qemu_get_clock_ns(rt_clock) - t0;
+    bwidth = (bytes_transferred - bytes_transferred_last) / t0;

     /* if we haven't transferred anything this round, force expected_time to a
      * a very high value, but without crashing */
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 08/10] Count nanoseconds with uint64_t not doubles
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 08/10] Count nanoseconds with uint64_t not doubles Juan Quintela
@ 2010-11-30  7:17   ` Paolo Bonzini
       [not found]   ` <4CF45C5B.9080507@codemonkey.ws>
  1 sibling, 0 replies; 75+ messages in thread
From: Paolo Bonzini @ 2010-11-30  7:17 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Juan Quintela

On 11/24/2010 12:03 AM, Juan Quintela wrote:
> -    bwidth = qemu_get_clock_ns(rt_clock) - bwidth;
> -    bwidth = (bytes_transferred - bytes_transferred_last) / bwidth;
> +    t0 = qemu_get_clock_ns(rt_clock) - t0;
> +    bwidth = (bytes_transferred - bytes_transferred_last) / t0;

The result here is integer too.  Is this desired?

Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <4CF45C5B.9080507@codemonkey.ws>]

* [Qemu-devel] Re: [PATCH 08/10] Count nanoseconds with uint64_t not doubles
       [not found]   ` <4CF45C5B.9080507@codemonkey.ws>
@ 2010-11-30 14:40     ` Juan Quintela
  0 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 14:40 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/23/2010 05:03 PM, Juan Quintela wrote:
>> From: Juan Quintela<quintela@trasno.org>
>>
>> Signed-off-by: Juan Quintela<quintela@trasno.org>
>> Signed-off-by: Juan Quintela<quintela@redhat.com>
>>    
>
> Why?

I needed the precision for the timestamp.  Nanoseconds is an "integer"
property, not a float value.  And ferther more, until last line, we are
using a variable named "bwidth" to measure times.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> ---
>>   arch_init.c |    7 ++++---
>>   1 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 9e941a0..d32aaae 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -216,6 +216,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>   {
>>       ram_addr_t addr;
>>       uint64_t bytes_transferred_last;
>> +    uint64_t t0;
>>       double bwidth = 0;
>>
>>       if (stage<  0) {
>> @@ -258,7 +259,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>       }
>>
>>       bytes_transferred_last = bytes_transferred;
>> -    bwidth = qemu_get_clock_ns(rt_clock);
>> +    t0 = qemu_get_clock_ns(rt_clock);
>>
>>       while (!qemu_file_rate_limit(f)) {
>>           int bytes_sent;
>> @@ -270,8 +271,8 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>           }
>>       }
>>
>> -    bwidth = qemu_get_clock_ns(rt_clock) - bwidth;
>> -    bwidth = (bytes_transferred - bytes_transferred_last) / bwidth;
>> +    t0 = qemu_get_clock_ns(rt_clock) - t0;
>> +    bwidth = (bytes_transferred - bytes_transferred_last) / t0;
>>
>>       /* if we haven't transferred anything this round, force expected_time to a
>>        * a very high value, but without crashing */
>>    

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 09/10] Exit loop if we have been there too long
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (7 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 08/10] Count nanoseconds with uint64_t not doubles Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
  2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
       [not found]   ` <4CF45D67.5010906@codemonkey.ws>
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 10/10] Maintaing number of dirty pages Juan Quintela
  9 siblings, 2 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

cheking each 64 pages is a random magic number as good as any other.
We don't want to test too many times, but on the other hand,
qemu_get_clock_ns() is not so expensive either.

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch_init.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index d32aaae..b463798 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -40,6 +40,7 @@
 #include "net.h"
 #include "gdbstub.h"
 #include "hw/smbios.h"
+#include "buffered_file.h"

 #ifdef TARGET_SPARC
 int graphic_width = 1024;
@@ -218,6 +219,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     uint64_t bytes_transferred_last;
     uint64_t t0;
     double bwidth = 0;
+    int i;

     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
@@ -261,6 +263,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     bytes_transferred_last = bytes_transferred;
     t0 = qemu_get_clock_ns(rt_clock);

+    i = 0;
     while (!qemu_file_rate_limit(f)) {
         int bytes_sent;

@@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
         if (bytes_sent == 0) { /* no more blocks */
             break;
         }
+	/* we want to check in the 1st loop, just in case it was the 1st time
+           and we had to sync the dirty bitmap.
+           qemu_get_clock_ns() is a bit expensive, so we only check each some
+           iterations
+	*/
+        if ((i & 63) == 0) {
+            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
+            if (t1 > buffered_file_interval/2) {
+                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
+		break;
+	    }
+	}
+        i++;
     }

     t0 = qemu_get_clock_ns(rt_clock) - t0;
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 09/10] Exit loop if we have been there too long Juan Quintela
@ 2010-11-24 10:40   ` Michael S. Tsirkin
  2010-11-24 11:01     ` Juan Quintela
       [not found]   ` <4CF45D67.5010906@codemonkey.ws>
  1 sibling, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 10:40 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, Juan Quintela

On Wed, Nov 24, 2010 at 12:03:06AM +0100, Juan Quintela wrote:
> From: Juan Quintela <quintela@trasno.org>
> 
> cheking each 64 pages is a random magic number as good as any other.
> We don't want to test too many times, but on the other hand,
> qemu_get_clock_ns() is not so expensive either.
> 

Could you please explain what's the problem this fixes?
I would like to see an API that documents the contract
we are making with the backend.

> Signed-off-by: Juan Quintela <quintela@trasno.org>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  arch_init.c |   16 ++++++++++++++++
>  1 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index d32aaae..b463798 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -40,6 +40,7 @@
>  #include "net.h"
>  #include "gdbstub.h"
>  #include "hw/smbios.h"
> +#include "buffered_file.h"
> 
>  #ifdef TARGET_SPARC
>  int graphic_width = 1024;
> @@ -218,6 +219,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      uint64_t bytes_transferred_last;
>      uint64_t t0;
>      double bwidth = 0;
> +    int i;
> 
>      if (stage < 0) {
>          cpu_physical_memory_set_dirty_tracking(0);
> @@ -261,6 +263,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>      bytes_transferred_last = bytes_transferred;
>      t0 = qemu_get_clock_ns(rt_clock);
> 
> +    i = 0;
>      while (!qemu_file_rate_limit(f)) {
>          int bytes_sent;
> 
> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>          if (bytes_sent == 0) { /* no more blocks */
>              break;
>          }
> +	/* we want to check in the 1st loop, just in case it was the 1st time
> +           and we had to sync the dirty bitmap.
> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
> +           iterations
> +	*/
> +        if ((i & 63) == 0) {
> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;

This adds even more non-determinism to savevm behaviour.  If bandwidth
limit is higth enough, I expect it to just keep going.

> +            if (t1 > buffered_file_interval/2) {

arch_init should not depend on buffered_file implementation IMO.

Also - / 2?

> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);

Is this a debugging aid?

> +		break;
> +	    }
> +	}
> +        i++;
>      }
> 
>      t0 = qemu_get_clock_ns(rt_clock) - t0;
> -- 
> 1.7.3.2
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
@ 2010-11-24 11:01     ` Juan Quintela
  2010-11-24 11:14       ` Michael S. Tsirkin
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-24 11:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Nov 24, 2010 at 12:03:06AM +0100, Juan Quintela wrote:
>> From: Juan Quintela <quintela@trasno.org>
>> 
>> cheking each 64 pages is a random magic number as good as any other.
>> We don't want to test too many times, but on the other hand,
>> qemu_get_clock_ns() is not so expensive either.
>> 
>
> Could you please explain what's the problem this fixes?
> I would like to see an API that documents the contract
> we are making with the backend.

buffered_file is an "abstraction" that uses a buffer.

live migration code (remember it can't sleep, it runs on the main loop)
stores its "stuff" on that buffer.  And a timer writes that buffer to
the fd that is associated with migration.

This design is due to the main_loop/no threads qemu model.

buffered_file timer runs each 100ms.  And we "try" to measure channel
bandwidth from there.  If we are not able to run the timer, all the
calculations are wrong, and then stalls happens.

>> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>>          if (bytes_sent == 0) { /* no more blocks */
>>              break;
>>          }
>> +	/* we want to check in the 1st loop, just in case it was the 1st time
>> +           and we had to sync the dirty bitmap.
>> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
>> +           iterations
>> +	*/
>> +        if ((i & 63) == 0) {
>> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
>
> This adds even more non-determinism to savevm behaviour.  If bandwidth
> limit is higth enough, I expect it to just keep going.

If we find a row of 512MB of zero pages together (and that happens if
you have a 64GB iddle guest, then you can spent more than 3seconds to
fill the default bandwith).  After that everything that uses the main
loop has had stalls.

>> +            if (t1 > buffered_file_interval/2) {
>
> arch_init should not depend on buffered_file implementation IMO.
>
> Also - / 2?

We need to run a timer each 100ms.  For times look at the 0/6 patch.
We can't spent more that 50ms in each function.  It is something that
should happen for all funnctions called from io_handlers.

>> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
>
> Is this a debugging aid?

I left that on purpose, to show that it happens a lot.  There is no
DEBUG_ARCH or DEBUG_RAM around, I can create them if you preffer.  But
notice that this is something that shouldn't happen (but it happens).

DPRINTF for that file should be a good idea, will do.

>> +		break;
>> +	    }
>> +	}
>> +        i++;
>>      }
>> 
>>      t0 = qemu_get_clock_ns(rt_clock) - t0;
>> -- 
>> 1.7.3.2
>> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-24 11:01     ` Juan Quintela
@ 2010-11-24 11:14       ` Michael S. Tsirkin
  2010-11-24 15:16         ` Paolo Bonzini
  0 siblings, 1 reply; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 11:14 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

On Wed, Nov 24, 2010 at 12:01:51PM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Wed, Nov 24, 2010 at 12:03:06AM +0100, Juan Quintela wrote:
> >> From: Juan Quintela <quintela@trasno.org>
> >> 
> >> cheking each 64 pages is a random magic number as good as any other.
> >> We don't want to test too many times, but on the other hand,
> >> qemu_get_clock_ns() is not so expensive either.
> >> 
> >
> > Could you please explain what's the problem this fixes?
> > I would like to see an API that documents the contract
> > we are making with the backend.
> 
> buffered_file is an "abstraction" that uses a buffer.
> 
> live migration code (remember it can't sleep, it runs on the main loop)
> stores its "stuff" on that buffer.  And a timer writes that buffer to
> the fd that is associated with migration.
> 
> This design is due to the main_loop/no threads qemu model.
> 
> buffered_file timer runs each 100ms.  And we "try" to measure channel
> bandwidth from there.  If we are not able to run the timer, all the
> calculations are wrong, and then stalls happens.

So the problem is the timer in the buffered file abstraction?
Why don't we just flush out data if the buffer is full?

> 
> 
> >> @@ -269,6 +272,19 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
> >>          if (bytes_sent == 0) { /* no more blocks */
> >>              break;
> >>          }
> >> +	/* we want to check in the 1st loop, just in case it was the 1st time
> >> +           and we had to sync the dirty bitmap.
> >> +           qemu_get_clock_ns() is a bit expensive, so we only check each some
> >> +           iterations
> >> +	*/
> >> +        if ((i & 63) == 0) {
> >> +            uint64_t t1 = (qemu_get_clock_ns(rt_clock) - t0) / 1000000;
> >
> > This adds even more non-determinism to savevm behaviour.  If bandwidth
> > limit is higth enough, I expect it to just keep going.
> 
> If we find a row of 512MB of zero pages together (and that happens if
> you have a 64GB iddle guest, then you can spent more than 3seconds to
> fill the default bandwith).  After that everything that uses the main
> loop has had stalls.
> 
> 
> >> +            if (t1 > buffered_file_interval/2) {
> >
> > arch_init should not depend on buffered_file implementation IMO.
> >
> > Also - / 2?
> 
> We need to run a timer each 100ms.  For times look at the 0/6 patch.
> We can't spent more that 50ms in each function.  It is something that
> should happen for all funnctions called from io_handlers.
> 
> >> +                printf("big delay %ld milliseconds, %d iterations\n", t1, i);
> >
> > Is this a debugging aid?
> 
> I left that on purpose, to show that it happens a lot.  There is no
> DEBUG_ARCH or DEBUG_RAM around, I can create them if you preffer.  But
> notice that this is something that shouldn't happen (but it happens).
> 
> DPRINTF for that file should be a good idea, will do.
> 
> >> +		break;
> >> +	    }
> >> +	}
> >> +        i++;
> >>      }
> >> 
> >>      t0 = qemu_get_clock_ns(rt_clock) - t0;
> >> -- 
> >> 1.7.3.2
> >> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-24 11:14       ` Michael S. Tsirkin
@ 2010-11-24 15:16         ` Paolo Bonzini
  2010-11-24 15:59           ` Michael S. Tsirkin
       [not found]           ` <4CF45E3F.4040609@codemonkey.ws>
  0 siblings, 2 replies; 75+ messages in thread
From: Paolo Bonzini @ 2010-11-24 15:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, Juan Quintela

On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
>> >  buffered_file timer runs each 100ms.  And we "try" to measure channel
>> >  bandwidth from there.  If we are not able to run the timer, all the
>> >  calculations are wrong, and then stalls happens.
>
> So the problem is the timer in the buffered file abstraction?
> Why don't we just flush out data if the buffer is full?

It takes a lot to fill the buffer if you have many zero pages, and if 
that happens the guest is starved by the main loop filling the buffer.

Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-24 15:16         ` Paolo Bonzini
@ 2010-11-24 15:59           ` Michael S. Tsirkin
       [not found]           ` <4CF45E3F.4040609@codemonkey.ws>
  1 sibling, 0 replies; 75+ messages in thread
From: Michael S. Tsirkin @ 2010-11-24 15:59 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, Juan Quintela

On Wed, Nov 24, 2010 at 04:16:04PM +0100, Paolo Bonzini wrote:
> On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
> >>>  buffered_file timer runs each 100ms.  And we "try" to measure channel
> >>>  bandwidth from there.  If we are not able to run the timer, all the
> >>>  calculations are wrong, and then stalls happens.
> >
> >So the problem is the timer in the buffered file abstraction?
> >Why don't we just flush out data if the buffer is full?
> 
> It takes a lot to fill the buffer if you have many zero pages, and
> if that happens the guest is starved by the main loop filling the
> buffer.
> 
> Paolo

To solve starvation we really should return to main loop
as soon as possible, give io a chance to run.
The timer based hack is probably the best we can do by 0.14, though.

-- 
MST

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <4CF45E3F.4040609@codemonkey.ws>]

* Re: [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
       [not found]           ` <4CF45E3F.4040609@codemonkey.ws>
@ 2010-11-30  8:10             ` Paolo Bonzini
  2010-11-30 13:26             ` Juan Quintela
  1 sibling, 0 replies; 75+ messages in thread
From: Paolo Bonzini @ 2010-11-30  8:10 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Juan Quintela, qemu-devel, Michael S. Tsirkin

On 11/30/2010 03:15 AM, Anthony Liguori wrote:
>
> Sounds like the sort of thing you'd only see if you created a guest a
> large guest that was mostly unallocated and then tried to migrate.

Or if you migrate a guest that was booted using the memory= option.

Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
       [not found]           ` <4CF45E3F.4040609@codemonkey.ws>
  2010-11-30  8:10             ` Paolo Bonzini
@ 2010-11-30 13:26             ` Juan Quintela
  1 sibling, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 13:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paolo Bonzini, qemu-devel, Michael S. Tsirkin

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/24/2010 09:16 AM, Paolo Bonzini wrote:
>> On 11/24/2010 12:14 PM, Michael S. Tsirkin wrote:
>>>> >  buffered_file timer runs each 100ms.  And we "try" to measure
>>>> channel
>>>> >  bandwidth from there.  If we are not able to run the timer, all the
>>>> >  calculations are wrong, and then stalls happens.
>>>
>>> So the problem is the timer in the buffered file abstraction?
>>> Why don't we just flush out data if the buffer is full?
>>
>> It takes a lot to fill the buffer if you have many zero pages, and
>> if that happens the guest is starved by the main loop filling the
>> buffer.
>
> Sounds like the sort of thing you'd only see if you created a guest a
> large guest that was mostly unallocated and then tried to migrate.
> That doesn't seem like a very real case to me though.

No.  this is the "easy" to reproduce case.  You can get that in normal
use.  Just with an idle guest with loads of memory is the worst possible
case, and trivial to reproduce.

> The best approach would be to drop qemu_mutex while processing this
> code instead of having an arbitrary back-off point.  The later is
> deferring the problem to another day when it becomes the source of a
> future problem.

As told in the other mail, you are offering me half a solution.  If I
implemente the qemu_mutex change (that I will do) we would still have
this problem on the main loop.  CPU stuck for 10s will be done, but
nothing else.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> Paolo
>>

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <4CF45D67.5010906@codemonkey.ws>]

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
       [not found]   ` <4CF45D67.5010906@codemonkey.ws>
@ 2010-11-30  7:15     ` Paolo Bonzini
  2010-11-30 13:47       ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Paolo Bonzini @ 2010-11-30  7:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela, Juan Quintela

On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>
> BufferedFile should hit the qemu_file_rate_limit check when the socket
> buffer gets filled up.

The problem is that the file rate limit is not hit because work is done 
elsewhere.  The rate can limit the bandwidth used and makes QEMU aware 
that socket operations may block (because that's what the buffered file 
freeze/unfreeze logic does); but it cannot be used to limit the _time_ 
spent in the migration code.

Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30  7:15     ` Paolo Bonzini
@ 2010-11-30 13:47       ` Anthony Liguori
  2010-11-30 13:58         ` Avi Kivity
  2010-11-30 14:12         ` Paolo Bonzini
  0 siblings, 2 replies; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 13:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Juan Quintela, qemu-devel, kvm-devel, Juan Quintela

On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>
>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>> buffer gets filled up.
>
> The problem is that the file rate limit is not hit because work is 
> done elsewhere.  The rate can limit the bandwidth used and makes QEMU 
> aware that socket operations may block (because that's what the 
> buffered file freeze/unfreeze logic does); but it cannot be used to 
> limit the _time_ spent in the migration code.

Yes, it can, if you set the rate limit sufficiently low.

The caveats are 1) the kvm.ko interface for dirty bits doesn't scale for 
large memory guests so we spend a lot more CPU time walking it than we 
should 2) zero pages cause us to burn a lot more CPU time than we 
otherwise would because compressing them is so effective.

In the short term, fixing (2) by accounting zero pages as full sized 
pages should "fix" the problem.

In the long term, we need a new dirty bit interface from kvm.ko that 
uses a multi-level table.  That should dramatically improve scan 
performance.  We also need to implement live migration in a separate 
thread that doesn't carry qemu_mutex while it runs.

Regards,

Anthony Liguori

> Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 13:47       ` Anthony Liguori
@ 2010-11-30 13:58         ` Avi Kivity
  2010-11-30 14:17           ` Anthony Liguori
  2010-11-30 14:12         ` Paolo Bonzini
  1 sibling, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2010-11-30 13:58 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, Juan Quintela, qemu-devel, kvm-devel,
	Juan Quintela

On 11/30/2010 03:47 PM, Anthony Liguori wrote:
> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>
>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>> buffer gets filled up.
>>
>> The problem is that the file rate limit is not hit because work is 
>> done elsewhere.  The rate can limit the bandwidth used and makes QEMU 
>> aware that socket operations may block (because that's what the 
>> buffered file freeze/unfreeze logic does); but it cannot be used to 
>> limit the _time_ spent in the migration code.
>
> Yes, it can, if you set the rate limit sufficiently low.
>
> The caveats are 1) the kvm.ko interface for dirty bits doesn't scale 
> for large memory guests so we spend a lot more CPU time walking it 
> than we should 2) zero pages cause us to burn a lot more CPU time than 
> we otherwise would because compressing them is so effective.

What's the problem with burning that cpu?  per guest page, compressing 
takes less than sending.  Is it just an issue of qemu mutex hold time?

>
> In the short term, fixing (2) by accounting zero pages as full sized 
> pages should "fix" the problem.
>
> In the long term, we need a new dirty bit interface from kvm.ko that 
> uses a multi-level table.  That should dramatically improve scan 
> performance. 

Why would a multi-level table help?  (or rather, please explain what you 
mean by a multi-level table).

Something we could do is divide memory into more slots, and polling each 
slot when we start to scan its page range.  That reduces the time 
between sampling a page's dirtiness and sending it off, and reduces the 
latency incurred by the sampling.  There are also non-interface-changing 
ways to reduce this latency, like O(1) write protection, or using dirty 
bits instead of write protection when available.

> We also need to implement live migration in a separate thread that 
> doesn't carry qemu_mutex while it runs.

IMO that's the biggest hit currently.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 13:58         ` Avi Kivity
@ 2010-11-30 14:17           ` Anthony Liguori
  2010-11-30 14:27             ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 14:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Juan Quintela, qemu-devel, kvm-devel,
	Juan Quintela

On 11/30/2010 07:58 AM, Avi Kivity wrote:
> On 11/30/2010 03:47 PM, Anthony Liguori wrote:
>> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>>
>>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>>> buffer gets filled up.
>>>
>>> The problem is that the file rate limit is not hit because work is 
>>> done elsewhere.  The rate can limit the bandwidth used and makes 
>>> QEMU aware that socket operations may block (because that's what the 
>>> buffered file freeze/unfreeze logic does); but it cannot be used to 
>>> limit the _time_ spent in the migration code.
>>
>> Yes, it can, if you set the rate limit sufficiently low.
>>
>> The caveats are 1) the kvm.ko interface for dirty bits doesn't scale 
>> for large memory guests so we spend a lot more CPU time walking it 
>> than we should 2) zero pages cause us to burn a lot more CPU time 
>> than we otherwise would because compressing them is so effective.
>
> What's the problem with burning that cpu?  per guest page, compressing 
> takes less than sending.  Is it just an issue of qemu mutex hold time?

If you have a 512GB guest, then you have a 16MB dirty bitmap which ends 
up being an 128MB dirty bitmap in QEMU because we represent dirty bits 
with 8 bits.

Walking 16mb (or 128mb) of memory just fine find a few pages to send 
over the wire is a big waste of CPU time.  If kvm.ko used a multi-level 
table to represent dirty info, we could walk the memory mapping at 2MB 
chunks allowing us to skip a large amount of the comparisons.

>> In the short term, fixing (2) by accounting zero pages as full sized 
>> pages should "fix" the problem.
>>
>> In the long term, we need a new dirty bit interface from kvm.ko that 
>> uses a multi-level table.  That should dramatically improve scan 
>> performance. 
>
> Why would a multi-level table help?  (or rather, please explain what 
> you mean by a multi-level table).
>
> Something we could do is divide memory into more slots, and polling 
> each slot when we start to scan its page range.  That reduces the time 
> between sampling a page's dirtiness and sending it off, and reduces 
> the latency incurred by the sampling.  There are also 
> non-interface-changing ways to reduce this latency, like O(1) write 
> protection, or using dirty bits instead of write protection when 
> available.

BTW, we should also refactor qemu to use the kvm dirty bitmap directly 
instead of mapping it to the main dirty bitmap.

>> We also need to implement live migration in a separate thread that 
>> doesn't carry qemu_mutex while it runs.
>
> IMO that's the biggest hit currently.

Yup.  That's the Correct solution to the problem.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 14:17           ` Anthony Liguori
@ 2010-11-30 14:27             ` Avi Kivity
  2010-11-30 14:50               ` Anthony Liguori
                                 ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Avi Kivity @ 2010-11-30 14:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, Juan Quintela, qemu-devel, kvm-devel,
	Juan Quintela

On 11/30/2010 04:17 PM, Anthony Liguori wrote:
>> What's the problem with burning that cpu?  per guest page, 
>> compressing takes less than sending.  Is it just an issue of qemu 
>> mutex hold time?
>
>
> If you have a 512GB guest, then you have a 16MB dirty bitmap which 
> ends up being an 128MB dirty bitmap in QEMU because we represent dirty 
> bits with 8 bits.

Was there not a patchset to split each bit into its own bitmap?  And 
then copy the kvm or qemu master bitmap into each client bitmap as it 
became needed?

> Walking 16mb (or 128mb) of memory just fine find a few pages to send 
> over the wire is a big waste of CPU time.  If kvm.ko used a 
> multi-level table to represent dirty info, we could walk the memory 
> mapping at 2MB chunks allowing us to skip a large amount of the 
> comparisons.

There's no reason to assume dirty pages would be clustered.  If 0.2% of 
memory were dirty, but scattered uniformly, there would be no win from 
the two-level bitmap.  A loss, in fact: 2MB can be represented as 512 
bits or 64 bytes, just one cache line.  Any two-level thing will need more.

We might have a more compact encoding for sparse bitmaps, like 
run-length encoding.

>
>>> In the short term, fixing (2) by accounting zero pages as full sized 
>>> pages should "fix" the problem.
>>>
>>> In the long term, we need a new dirty bit interface from kvm.ko that 
>>> uses a multi-level table.  That should dramatically improve scan 
>>> performance. 
>>
>> Why would a multi-level table help?  (or rather, please explain what 
>> you mean by a multi-level table).
>>
>> Something we could do is divide memory into more slots, and polling 
>> each slot when we start to scan its page range.  That reduces the 
>> time between sampling a page's dirtiness and sending it off, and 
>> reduces the latency incurred by the sampling.  There are also 
>> non-interface-changing ways to reduce this latency, like O(1) write 
>> protection, or using dirty bits instead of write protection when 
>> available.
>
> BTW, we should also refactor qemu to use the kvm dirty bitmap directly 
> instead of mapping it to the main dirty bitmap.

That's what the patch set I was alluding to did.  Or maybe I imagined 
the whole thing.

>>> We also need to implement live migration in a separate thread that 
>>> doesn't carry qemu_mutex while it runs.
>>
>> IMO that's the biggest hit currently.
>
> Yup.  That's the Correct solution to the problem.

Then let's just Do it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 14:27             ` Avi Kivity
@ 2010-11-30 14:50               ` Anthony Liguori
  2010-12-01 12:40                 ` Avi Kivity
  2010-11-30 17:43               ` Juan Quintela
  2010-12-01  1:20               ` Takuya Yoshikawa
  2 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 14:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Juan Quintela, qemu-devel, kvm-devel,
	Juan Quintela

On 11/30/2010 08:27 AM, Avi Kivity wrote:
> On 11/30/2010 04:17 PM, Anthony Liguori wrote:
>>> What's the problem with burning that cpu?  per guest page, 
>>> compressing takes less than sending.  Is it just an issue of qemu 
>>> mutex hold time?
>>
>>
>> If you have a 512GB guest, then you have a 16MB dirty bitmap which 
>> ends up being an 128MB dirty bitmap in QEMU because we represent 
>> dirty bits with 8 bits.
>
> Was there not a patchset to split each bit into its own bitmap?  And 
> then copy the kvm or qemu master bitmap into each client bitmap as it 
> became needed?
>
>> Walking 16mb (or 128mb) of memory just fine find a few pages to send 
>> over the wire is a big waste of CPU time.  If kvm.ko used a 
>> multi-level table to represent dirty info, we could walk the memory 
>> mapping at 2MB chunks allowing us to skip a large amount of the 
>> comparisons.
>
> There's no reason to assume dirty pages would be clustered.  If 0.2% 
> of memory were dirty, but scattered uniformly, there would be no win 
> from the two-level bitmap.  A loss, in fact: 2MB can be represented as 
> 512 bits or 64 bytes, just one cache line.  Any two-level thing will 
> need more.
>
> We might have a more compact encoding for sparse bitmaps, like 
> run-length encoding.
>
>>
>>>> In the short term, fixing (2) by accounting zero pages as full 
>>>> sized pages should "fix" the problem.
>>>>
>>>> In the long term, we need a new dirty bit interface from kvm.ko 
>>>> that uses a multi-level table.  That should dramatically improve 
>>>> scan performance. 
>>>
>>> Why would a multi-level table help?  (or rather, please explain what 
>>> you mean by a multi-level table).
>>>
>>> Something we could do is divide memory into more slots, and polling 
>>> each slot when we start to scan its page range.  That reduces the 
>>> time between sampling a page's dirtiness and sending it off, and 
>>> reduces the latency incurred by the sampling.  There are also 
>>> non-interface-changing ways to reduce this latency, like O(1) write 
>>> protection, or using dirty bits instead of write protection when 
>>> available.
>>
>> BTW, we should also refactor qemu to use the kvm dirty bitmap 
>> directly instead of mapping it to the main dirty bitmap.
>
> That's what the patch set I was alluding to did.  Or maybe I imagined 
> the whole thing.

No, it just split the main bitmap into three bitmaps.  I'm suggesting 
that we have the dirty interface have two implementations, one that 
refers to the 8-bit bitmap when TCG in use and another one that uses the 
KVM representation.

TCG really needs multiple dirty bits but KVM doesn't.  A shared 
implementation really can't be optimal.

>
>>>> We also need to implement live migration in a separate thread that 
>>>> doesn't carry qemu_mutex while it runs.
>>>
>>> IMO that's the biggest hit currently.
>>
>> Yup.  That's the Correct solution to the problem.
>
> Then let's just Do it.
>

Yup.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 14:50               ` Anthony Liguori
@ 2010-12-01 12:40                 ` Avi Kivity
  0 siblings, 0 replies; 75+ messages in thread
From: Avi Kivity @ 2010-12-01 12:40 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paolo Bonzini, Juan Quintela, qemu-devel, kvm-devel,
	Juan Quintela

On 11/30/2010 04:50 PM, Anthony Liguori wrote:
>> That's what the patch set I was alluding to did.  Or maybe I imagined 
>> the whole thing.
>
>
> No, it just split the main bitmap into three bitmaps.  I'm suggesting 
> that we have the dirty interface have two implementations, one that 
> refers to the 8-bit bitmap when TCG in use and another one that uses 
> the KVM representation.
>
> TCG really needs multiple dirty bits but KVM doesn't.  A shared 
> implementation really can't be optimal.

Live migration and the framebuffer can certainly share code with kvm and 
tcg:

- tcg or kvm maintain an internal bitmap (kvm in the kernel, tcg updates 
a private bitmap)
- a dirty log client wants to see an updated bitmap; migration on a new 
pass, vga on screen refresh
    - ask the producer (kvm or tcg) to fetch-and-clear a dirty bitmap
    - broadcast it ( |= ) into any active clients (migration or framebuffer)
- everyone's happy

The code dirty thing might need special treatment, we can have a special 
tcg-only bitmap for it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 14:27             ` Avi Kivity
  2010-11-30 14:50               ` Anthony Liguori
@ 2010-11-30 17:43               ` Juan Quintela
  2010-12-01  1:20               ` Takuya Yoshikawa
  2 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 17:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paolo Bonzini, qemu-devel, kvm-devel

Avi Kivity <avi@redhat.com> wrote:
> On 11/30/2010 04:17 PM, Anthony Liguori wrote:
>>> What's the problem with burning that cpu?  per guest page,
>>> compressing takes less than sending.  Is it just an issue of qemu
>>> mutex hold time?
>>
>>
>> If you have a 512GB guest, then you have a 16MB dirty bitmap which
>> ends up being an 128MB dirty bitmap in QEMU because we represent
>> dirty bits with 8 bits.
>
> Was there not a patchset to split each bit into its own bitmap?  And
> then copy the kvm or qemu master bitmap into each client bitmap as it
> became needed?
>
>> Walking 16mb (or 128mb) of memory just fine find a few pages to send
>> over the wire is a big waste of CPU time.  If kvm.ko used a
>> multi-level table to represent dirty info, we could walk the memory
>> mapping at 2MB chunks allowing us to skip a large amount of the
>> comparisons.
>
> There's no reason to assume dirty pages would be clustered.  If 0.2%
> of memory were dirty, but scattered uniformly, there would be no win
> from the two-level bitmap.  A loss, in fact: 2MB can be represented as
> 512 bits or 64 bytes, just one cache line.  Any two-level thing will
> need more.
>
> We might have a more compact encoding for sparse bitmaps, like
> run-length encoding.


I haven't measured it, but I think that it would be much better that
way.  When we start, it don't matter too much (everything is dirty),
what we should optimize for is the last rounds, and in the last rounds
it would be much better to ask kvm:

fill this array of dirty pages offsets, and be done with it.
Not sure if adding a size field would improve things, both tests need to
be measured.

What would be a winner independent of that is a way to ask qemu the
number of dirty pages.  Just now we need to calculate them walking the
bitmap (one of my patches just simplifies this).

Adding the feature to qemu means that we could always give recent
information to "info migrate" without incurring in a big cost.

>> BTW, we should also refactor qemu to use the kvm dirty bitmap
>> directly instead of mapping it to the main dirty bitmap.
>
> That's what the patch set I was alluding to did.  Or maybe I imagined
> the whole thing.

it existed.  And today would be easier because KQEMU and VGA are not
needed  anymore.

>>>> We also need to implement live migration in a separate thread that
>>>> doesn't carry qemu_mutex while it runs.
>>>
>>> IMO that's the biggest hit currently.
>>
>> Yup.  That's the Correct solution to the problem.
>
> Then let's just Do it.

Will take a look at splittingthe qemu_mutex bit.

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 14:27             ` Avi Kivity
  2010-11-30 14:50               ` Anthony Liguori
  2010-11-30 17:43               ` Juan Quintela
@ 2010-12-01  1:20               ` Takuya Yoshikawa
  2010-12-01  1:52                 ` Juan Quintela
  2 siblings, 1 reply; 75+ messages in thread
From: Takuya Yoshikawa @ 2010-12-01  1:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm-devel, Juan Quintela, qemu-devel, Juan Quintela,
	Paolo Bonzini

On Tue, 30 Nov 2010 16:27:13 +0200
Avi Kivity <avi@redhat.com> wrote:

> On 11/30/2010 04:17 PM, Anthony Liguori wrote:
> >> What's the problem with burning that cpu?  per guest page, 
> >> compressing takes less than sending.  Is it just an issue of qemu 
> >> mutex hold time?
> >
> >
> > If you have a 512GB guest, then you have a 16MB dirty bitmap which 
> > ends up being an 128MB dirty bitmap in QEMU because we represent dirty 
> > bits with 8 bits.
> 
> Was there not a patchset to split each bit into its own bitmap?  And 
> then copy the kvm or qemu master bitmap into each client bitmap as it 
> became needed?
> 
> > Walking 16mb (or 128mb) of memory just fine find a few pages to send 
> > over the wire is a big waste of CPU time.  If kvm.ko used a 
> > multi-level table to represent dirty info, we could walk the memory 
> > mapping at 2MB chunks allowing us to skip a large amount of the 
> > comparisons.
> 
> There's no reason to assume dirty pages would be clustered.  If 0.2% of 
> memory were dirty, but scattered uniformly, there would be no win from 
> the two-level bitmap.  A loss, in fact: 2MB can be represented as 512 
> bits or 64 bytes, just one cache line.  Any two-level thing will need more.
> 
> We might have a more compact encoding for sparse bitmaps, like 
> run-length encoding.
> 

Does anyone is profiling these dirty bitmap things?

 - 512GB guest is really the target?
 - how much cpu time can we use for these things?
 - how many dirty pages do we have to care?

Since we are planning to do some profiling for these, taking into account
Kemari, can you please share these information?


> >
> >>> In the short term, fixing (2) by accounting zero pages as full sized 
> >>> pages should "fix" the problem.
> >>>
> >>> In the long term, we need a new dirty bit interface from kvm.ko that 
> >>> uses a multi-level table.  That should dramatically improve scan 
> >>> performance. 
> >>
> >> Why would a multi-level table help?  (or rather, please explain what 
> >> you mean by a multi-level table).
> >>
> >> Something we could do is divide memory into more slots, and polling 
> >> each slot when we start to scan its page range.  That reduces the 
> >> time between sampling a page's dirtiness and sending it off, and 
> >> reduces the latency incurred by the sampling.  There are also 

If we use rmap approach with one more interface, we can specify which
range of dirty bitmap to get. This has the same effect to splitting into
more slots.


> >> non-interface-changing ways to reduce this latency, like O(1) write 
> >> protection, or using dirty bits instead of write protection when 
> >> available.

IIUC, O(1) will lazily write protect pages beggining from top level?
Does this have any impact other than the timing of get_dirty_log()?


Thanks,
  Takuya


> >
> > BTW, we should also refactor qemu to use the kvm dirty bitmap directly 
> > instead of mapping it to the main dirty bitmap.
> 
> That's what the patch set I was alluding to did.  Or maybe I imagined 
> the whole thing.
> 
> >>> We also need to implement live migration in a separate thread that 
> >>> doesn't carry qemu_mutex while it runs.
> >>
> >> IMO that's the biggest hit currently.
> >
> > Yup.  That's the Correct solution to the problem.
> 
> Then let's just Do it.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-12-01  1:20               ` Takuya Yoshikawa
@ 2010-12-01  1:52                 ` Juan Quintela
  2010-12-01  2:22                   ` Takuya Yoshikawa
  2010-12-01 12:35                   ` Avi Kivity
  0 siblings, 2 replies; 75+ messages in thread
From: Juan Quintela @ 2010-12-01  1:52 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Paolo Bonzini, kvm-devel, Avi Kivity, qemu-devel

Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> wrote:
> On Tue, 30 Nov 2010 16:27:13 +0200
> Avi Kivity <avi@redhat.com> wrote:

> Does anyone is profiling these dirty bitmap things?

I am.

>  - 512GB guest is really the target?

no, problems exist with smaller amounts of RAM.  with 16GB guest it is
trivial to get 1s stalls, 64GB guest, 3-4s, with more memory, migration
is flaky to say the less.

>  - how much cpu time can we use for these things?

the problem here is that we are forced to walk the bitmap too many
times, we want to do it less times.

>  - how many dirty pages do we have to care?

default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
dirty pages to have only 30ms of downtime.

But notice that this is what we are advertising, we aren't near there at all.

> Since we are planning to do some profiling for these, taking into account
> Kemari, can you please share these information?

If you see the 0/10 email with this setup, you can see how much time are
we spending on stuff.  Just now (for migration, kemari is a bit
different) we have to fix other things first.

Next item for me is te improve that bitmap handling (we can at least
trivial divide the space used by 8, and use ffs to find dirty pages).

Thinknig about changing kvm interface when we convert it to the
bottleneck (it is not at this point).

>> >>> In the short term, fixing (2) by accounting zero pages as full sized 
>> >>> pages should "fix" the problem.
>> >>>
>> >>> In the long term, we need a new dirty bit interface from kvm.ko that 
>> >>> uses a multi-level table.  That should dramatically improve scan 
>> >>> performance. 
>> >>
>> >> Why would a multi-level table help?  (or rather, please explain what 
>> >> you mean by a multi-level table).
>> >>
>> >> Something we could do is divide memory into more slots, and polling 
>> >> each slot when we start to scan its page range.  That reduces the 
>> >> time between sampling a page's dirtiness and sending it off, and 
>> >> reduces the latency incurred by the sampling.  There are also 
>
> If we use rmap approach with one more interface, we can specify which
> range of dirty bitmap to get. This has the same effect to splitting into
> more slots.

kvm allows us to that today.  It is qemu who don't use this
information.  qemu always asks for the whole memory.  kvm is happy to
give only a range.

>> >> non-interface-changing ways to reduce this latency, like O(1) write 
>> >> protection, or using dirty bits instead of write protection when 
>> >> available.
>
> IIUC, O(1) will lazily write protect pages beggining from top level?
> Does this have any impact other than the timing of get_dirty_log()?

dunno.

At this point I am trying to:
- get migration with 16-64GB to not having stalls.
- get infrastructure to be able to know what is going on.

So far, bigger stalls are gone, and discusing what to do next.  As
Anthony suggested I run ram_save_live() loop without qemu_mutex, and now
guests get much better interaction, but my current patch (for this) is
just to put qemu_mutex_unlock_iothread()/qemu_mutex_lock_iothread()
around it.  I think that we are racey with teh access to the bitmap, but
it was just a test.

With respect to Kemari, we can discuss what do you need and how you are
going to test, just to not do overlapping work.

Thanks, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-12-01  1:52                 ` Juan Quintela
@ 2010-12-01  2:22                   ` Takuya Yoshikawa
  2010-12-01 12:35                   ` Avi Kivity
  1 sibling, 0 replies; 75+ messages in thread
From: Takuya Yoshikawa @ 2010-12-01  2:22 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Paolo Bonzini, kvm-devel, Avi Kivity, qemu-devel

On Wed, 01 Dec 2010 02:52:08 +0100
Juan Quintela <quintela@redhat.com> wrote:

> > Since we are planning to do some profiling for these, taking into account
> > Kemari, can you please share these information?
> 
> If you see the 0/10 email with this setup, you can see how much time are
> we spending on stuff.  Just now (for migration, kemari is a bit
> different) we have to fix other things first.
> 

Thank you for the information.
  Sorry, I only had the [9/10] in my kvm mail box and did not notice this [0/10].

> >> >> non-interface-changing ways to reduce this latency, like O(1) write 
> >> >> protection, or using dirty bits instead of write protection when 
> >> >> available.
> >
> > IIUC, O(1) will lazily write protect pages beggining from top level?
> > Does this have any impact other than the timing of get_dirty_log()?
> 
> dunno.
> 
> At this point I am trying to:
> - get migration with 16-64GB to not having stalls.
> - get infrastructure to be able to know what is going on.
> 
> So far, bigger stalls are gone, and discusing what to do next.  As
> Anthony suggested I run ram_save_live() loop without qemu_mutex, and now
> guests get much better interaction, but my current patch (for this) is
> just to put qemu_mutex_unlock_iothread()/qemu_mutex_lock_iothread()
> around it.  I think that we are racey with teh access to the bitmap, but
> it was just a test.
> 
> With respect to Kemari, we can discuss what do you need and how you are
> going to test, just to not do overlapping work.

I see, we've just started to talk about what we have to achieve and what we
can expect. I need to talk with Tamura-san about this a bit more before
showing some example targets.

Thanks,
  Takuya

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-12-01  1:52                 ` Juan Quintela
  2010-12-01  2:22                   ` Takuya Yoshikawa
@ 2010-12-01 12:35                   ` Avi Kivity
  2010-12-01 13:45                     ` Juan Quintela
  2010-12-02  1:31                     ` Takuya Yoshikawa
  1 sibling, 2 replies; 75+ messages in thread
From: Avi Kivity @ 2010-12-01 12:35 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Takuya Yoshikawa, Paolo Bonzini, qemu-devel, kvm-devel

On 12/01/2010 03:52 AM, Juan Quintela wrote:
> >   - 512GB guest is really the target?
>
> no, problems exist with smaller amounts of RAM.  with 16GB guest it is
> trivial to get 1s stalls, 64GB guest, 3-4s, with more memory, migration
> is flaky to say the less.
>
> >   - how much cpu time can we use for these things?
>
> the problem here is that we are forced to walk the bitmap too many
> times, we want to do it less times.

How much time is spent walking bitmaps?  Are you sure this is the problem?

> >   - how many dirty pages do we have to care?
>
> default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
> dirty pages to have only 30ms of downtime.

1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-12-01 12:35                   ` Avi Kivity
@ 2010-12-01 13:45                     ` Juan Quintela
  2010-12-02  1:31                     ` Takuya Yoshikawa
  1 sibling, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-12-01 13:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Takuya Yoshikawa, Paolo Bonzini, qemu-devel, kvm-devel

Avi Kivity <avi@redhat.com> wrote:
> On 12/01/2010 03:52 AM, Juan Quintela wrote:
>> >   - 512GB guest is really the target?
>>
>> no, problems exist with smaller amounts of RAM.  with 16GB guest it is
>> trivial to get 1s stalls, 64GB guest, 3-4s, with more memory, migration
>> is flaky to say the less.
>>
>> >   - how much cpu time can we use for these things?
>>
>> the problem here is that we are forced to walk the bitmap too many
>> times, we want to do it less times.
>
> How much time is spent walking bitmaps?  Are you sure this is the problem?

see my 10/10 patch, that one makes ram_save_live() to move from 80% of the
time in the profile to the second one with ~5% or so.

The important bit is:

 static uint64_t ram_save_remaining(void)
 {
-    RAMBlock *block;
-    uint64_t count = 0;
-
-    QLIST_FOREACH(block, &ram_list.blocks, next) {
-        ram_addr_t addr;
-        for (addr = block->offset; addr < block->offset + block->length;
-             addr += TARGET_PAGE_SIZE) {
-            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
-                count++;
-            }
-        }
-    }
-
-    return count;
+    return ram_list.dirty_pages;
 }


We dont' need to walk all bitmap to see how much memory is remaining.

syncing the bitmap is also expensive, but just now we have other issues
that hide it.  That is the reason that I didn't started trynig to get a
better interface with kvm, 1st I will try to remove the bigger
bottlenecks.

>> >   - how many dirty pages do we have to care?
>>
>> default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
>> dirty pages to have only 30ms of downtime.
>
> 1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.

I will learn to make math with time and bytes at some point O:-)

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-12-01 12:35                   ` Avi Kivity
  2010-12-01 13:45                     ` Juan Quintela
@ 2010-12-02  1:31                     ` Takuya Yoshikawa
  2010-12-02  8:37                       ` Avi Kivity
  1 sibling, 1 reply; 75+ messages in thread
From: Takuya Yoshikawa @ 2010-12-02  1:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paolo Bonzini, kvm-devel, qemu-devel, Juan Quintela

Thanks for the answers Avi, Juan,

Some FYI, (not about the bottleneck)

On Wed, 01 Dec 2010 14:35:57 +0200
Avi Kivity <avi@redhat.com> wrote:

> > >   - how many dirty pages do we have to care?
> >
> > default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
> > dirty pages to have only 30ms of downtime.
> 
> 1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.
> 

3MB / 4KB/page = 750 pages.

Then, KVM side processing is near the theoretical goal!

In my framebuffer test, I tested

  nr_dirty_pages/npages = 576/4096

case with the rate of 20 updates/s (1updates/50ms).

Using rmap optimization, write protection only took 46,718 tsc time.
Bitmap copy was not a problem of course.

The display was working anyway at this rate!

In my guess, within 1,000 dirty pages, kvm_vm_ioctl_get_dirty_log()
can be processed within 200us or so even for large RAM slot.
 - rmap optimization depends mainly on nr_dirty_pages but npages.

Avi, can you guess the property of O(1) write protection?
I want to test rmap optimization taking these issues into acount.

Of course, Kemari have to continue synchronization, and maybe see
more dirty pages. This will be a future task!

Thanks,
  Takuya

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-12-02  1:31                     ` Takuya Yoshikawa
@ 2010-12-02  8:37                       ` Avi Kivity
  0 siblings, 0 replies; 75+ messages in thread
From: Avi Kivity @ 2010-12-02  8:37 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Paolo Bonzini, kvm-devel, qemu-devel, Juan Quintela

On 12/02/2010 03:31 AM, Takuya Yoshikawa wrote:
> Thanks for the answers Avi, Juan,
>
> Some FYI, (not about the bottleneck)
>
> On Wed, 01 Dec 2010 14:35:57 +0200
> Avi Kivity<avi@redhat.com>  wrote:
>
> >  >  >    - how many dirty pages do we have to care?
> >  >
> >  >  default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
> >  >  dirty pages to have only 30ms of downtime.
> >
> >  1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.
> >
>
> 3MB / 4KB/page = 750 pages.
>
> Then, KVM side processing is near the theoretical goal!
>
> In my framebuffer test, I tested
>
>    nr_dirty_pages/npages = 576/4096
>
> case with the rate of 20 updates/s (1updates/50ms).
>
> Using rmap optimization, write protection only took 46,718 tsc time.

Yes, using rmap to drive write protection with sparse dirty bitmaps 
really helps.

> Bitmap copy was not a problem of course.
>
> The display was working anyway at this rate!
>
>
> In my guess, within 1,000 dirty pages, kvm_vm_ioctl_get_dirty_log()
> can be processed within 200us or so even for large RAM slot.
>   - rmap optimization depends mainly on nr_dirty_pages but npages.
>
> Avi, can you guess the property of O(1) write protection?
> I want to test rmap optimization taking these issues into acount.

I think we should use O(1) write protection only if there is a large 
number of dirty pages.  With a small number, using rmap guided by the 
previous dirty bitmap is faster.

So, under normal operation where only the framebuffer is logged, we'd 
use rmap write protection; when enabling logging for live migration we'd 
use O(1) write protection, after a few iterations when the number of 
dirty pages drops, we switch back to rmap write protection.

> Of course, Kemari have to continue synchronization, and maybe see
> more dirty pages. This will be a future task!
>

There's yet another option, of using dirty bits instead of write 
protection.  Or maybe using write protection in the upper page tables 
and dirty bits in the lowest level.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 13:47       ` Anthony Liguori
  2010-11-30 13:58         ` Avi Kivity
@ 2010-11-30 14:12         ` Paolo Bonzini
  2010-11-30 15:00           ` Anthony Liguori
  1 sibling, 1 reply; 75+ messages in thread
From: Paolo Bonzini @ 2010-11-30 14:12 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Juan Quintela, qemu-devel, kvm-devel, Juan Quintela

On 11/30/2010 02:47 PM, Anthony Liguori wrote:
> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>
>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>> buffer gets filled up.
>>
>> The problem is that the file rate limit is not hit because work is
>> done elsewhere. The rate can limit the bandwidth used and makes QEMU
>> aware that socket operations may block (because that's what the
>> buffered file freeze/unfreeze logic does); but it cannot be used to
>> limit the _time_ spent in the migration code.
>
> Yes, it can, if you set the rate limit sufficiently low.

You mean, just like you can drive a car without brakes by keeping the 
speed sufficiently low.

> [..] accounting zero pages as full sized
> pages should "fix" the problem.

I know you used quotes, but it's a very very generous definition of fix. 
  Both these proposed "fixes" are nothing more than workarounds, and 
even particularly ugly ones.  The worst thing about them is that there 
is no guarantee of migration finishing in a reasonable time, or at all.

If you account zero pages as full, you don't use effectively the 
bandwidth that was allotted to you, you use only 0.2% of it (8/4096). 
It then takes an exaggerate amount of time to start iteration on pages 
that matter.  If you set the bandwidth low, instead, you do not have the 
bandwidth you need in order to converge.

Even from an aesthetic point of view, if there is such a thing, I don't 
understand why you advocate conflating network bandwidth and CPU usage 
into a single measurement.  Nobody disagrees that all you propose is 
nice to have, and that what Juan sent is a stopgap measure (though a 
very effective one).  However, this doesn't negate that Juan's 
accounting patches make a lot of sense in the current design.

> In the long term, we need a new dirty bit interface from kvm.ko that
> uses a multi-level table. That should dramatically improve scan
> performance. We also need to implement live migration in a separate
> thread that doesn't carry qemu_mutex while it runs.

This may be a good way to fix it, but it's also basically a rewrite.

Paolo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 14:12         ` Paolo Bonzini
@ 2010-11-30 15:00           ` Anthony Liguori
  2010-11-30 17:59             ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-11-30 15:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Juan Quintela, qemu-devel, kvm-devel, Juan Quintela

On 11/30/2010 08:12 AM, Paolo Bonzini wrote:
> On 11/30/2010 02:47 PM, Anthony Liguori wrote:
>> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:
>>> On 11/30/2010 03:11 AM, Anthony Liguori wrote:
>>>>
>>>> BufferedFile should hit the qemu_file_rate_limit check when the socket
>>>> buffer gets filled up.
>>>
>>> The problem is that the file rate limit is not hit because work is
>>> done elsewhere. The rate can limit the bandwidth used and makes QEMU
>>> aware that socket operations may block (because that's what the
>>> buffered file freeze/unfreeze logic does); but it cannot be used to
>>> limit the _time_ spent in the migration code.
>>
>> Yes, it can, if you set the rate limit sufficiently low.
>
> You mean, just like you can drive a car without brakes by keeping the 
> speed sufficiently low.
>
>> [..] accounting zero pages as full sized
>> pages should "fix" the problem.
>
> I know you used quotes, but it's a very very generous definition of 
> fix.  Both these proposed "fixes" are nothing more than workarounds, 
> and even particularly ugly ones.  The worst thing about them is that 
> there is no guarantee of migration finishing in a reasonable time, or 
> at all.
>
> If you account zero pages as full, you don't use effectively the 
> bandwidth that was allotted to you, you use only 0.2% of it (8/4096). 
> It then takes an exaggerate amount of time to start iteration on pages 
> that matter.  If you set the bandwidth low, instead, you do not have 
> the bandwidth you need in order to converge.
>
> Even from an aesthetic point of view, if there is such a thing, I 
> don't understand why you advocate conflating network bandwidth and CPU 
> usage into a single measurement.  Nobody disagrees that all you 
> propose is nice to have, and that what Juan sent is a stopgap measure 
> (though a very effective one).  However, this doesn't negate that 
> Juan's accounting patches make a lot of sense in the current design.

Juan's patch, IIUC, does the following: If you've been iterating in a 
tight loop, return to the main loop for *one* iteration every 50ms.

But this means that during this 50ms period of time, a VCPU may be 
blocked from running.  If the guest isn't doing a lot of device I/O 
*and* you're on a relatively low link speed, then this will mean that 
you don't hold qemu_mutex for more than 50ms at a time.

But in the degenerate case where you have a high speed link and you have 
a guest doing a lot of device I/O, you'll see the guest VCPU being 
blocked for 50ms, then getting to run for a very brief period of time, 
followed by another block for 50ms.  The guest's execution will be 
extremely sporadic.

This isn't fixable with this approach.  The only way to really fix this 
is to say that over a given period of time, migration may only consume 
XX amount of CPU time which guarantees the VCPUs get the qemu_mutex for 
the rest of the time.

This is exactly what rate limiting does.  Yes, it results in a longer 
migration time but that's the trade-off we have to make if we want 
deterministic VCPU execution until we can implement threading properly.

If you want a simple example, doing I/O with the rtl8139 adapter while 
doing your migration test and run a tight loop in the get running 
gettimeofday().  Graph the results to see how much execution time the 
guest is actually getting.

>> In the long term, we need a new dirty bit interface from kvm.ko that
>> uses a multi-level table. That should dramatically improve scan
>> performance. We also need to implement live migration in a separate
>> thread that doesn't carry qemu_mutex while it runs.
>
> This may be a good way to fix it, but it's also basically a rewrite.

The only correct short term solution I can see if rate limiting 
unfortunately.

Regards,

Anthony Liguori

> Paolo
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 09/10] Exit loop if we have been there too long
  2010-11-30 15:00           ` Anthony Liguori
@ 2010-11-30 17:59             ` Juan Quintela
  0 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 17:59 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paolo Bonzini, qemu-devel, kvm-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/30/2010 08:12 AM, Paolo Bonzini wrote:
>> On 11/30/2010 02:47 PM, Anthony Liguori wrote:
>>> On 11/30/2010 01:15 AM, Paolo Bonzini wrote:

> Juan's patch, IIUC, does the following: If you've been iterating in a
> tight loop, return to the main loop for *one* iteration every 50ms.

I dont' think it that way, more to the lines that: if you have spent
50ms, time to leave others to run O:-)  But yes, this is one of the changes.

> But this means that during this 50ms period of time, a VCPU may be
> blocked from running.  If the guest isn't doing a lot of device I/O
> *and* you're on a relatively low link speed, then this will mean that
> you don't hold qemu_mutex for more than 50ms at a time.
>
> But in the degenerate case where you have a high speed link and you
> have a guest doing a lot of device I/O, you'll see the guest VCPU
> being blocked for 50ms, then getting to run for a very brief period of
> time, followed by another block for 50ms.  The guest's execution will
> be extremely sporadic.

ok, lets go back a bit.  we have two problems:
- vcpu stalls
- main_loop stalls

my patch reduce stalls to a maximun of 50ms, just now they can be much
bigger.  Fully agree that doing ram_save_live without qemu_mutex is
an improvement.  But we are still hogging teh main loop, so we need my
patch and splitting qemu_mutex.

In other words, splitting qemu_mutex will fix half of the problem, not
all of it.

> This isn't fixable with this approach.  The only way to really fix
> this is to say that over a given period of time, migration may only
> consume XX amount of CPU time which guarantees the VCPUs get the
> qemu_mutex for the rest of the time.
>
> This is exactly what rate limiting does.  Yes, it results in a longer
> migration time but that's the trade-off we have to make if we want
> deterministic VCPU execution until we can implement threading
> properly.

I just give up here the discussion.  Basically I abused one thing (a
timeout), that was trivial to understand: if we have spent too much
time, leave others to run.

It is not a perfect solution, but it is a big improvement over what we
had.

I have been suggested that I use the other appoarch, and that I abuse
qemu_file_rate_limit().  creating a new qemu_file_write_nop() that only
increases xfer bytes transmited.  And then trust the rate limiting code
to fix the problem.  How this is nicer/clearer is completely alien to
me.  That probably also reduces the stalls, perhaps (I have to look at
the interactions), but it is a complete lie, we are counting as
trasfering stuff, stuff that we are not transfering :(

Why I don't like it?  Because what I found when I started is that this
rate_limit works very badly if we are not able to run the buffered_file
limit each 100ms.  If we lost ticks of that timer, our calculations are
wrong.

> If you want a simple example, doing I/O with the rtl8139 adapter while
> doing your migration test and run a tight loop in the get running
> gettimeofday().  Graph the results to see how much execution time the
> guest is actually getting.

I am still convinced that we need to limit the ammount of time that an
io_handler can use.

And that we should write scare things to the logs when an io_handler
needs too much time.

There are two independent problems.  I fully agree that having
ram_save_live() without qemu_mutex is an improvement.  But it is an
improvement independently of my patch.

>>> In the long term, we need a new dirty bit interface from kvm.ko that
>>> uses a multi-level table. That should dramatically improve scan
>>> performance. We also need to implement live migration in a separate
>>> thread that doesn't carry qemu_mutex while it runs.
>>
>> This may be a good way to fix it, but it's also basically a rewrite.
>
> The only correct short term solution I can see if rate limiting
> unfortunately.

I still think that this is an inferior solution, we are using something
to limit the amount of stuff that we write to the network to limit the
amount of pages that we walk.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> Paolo
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] [PATCH 10/10] Maintaing number of dirty pages
  2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
                   ` (8 preceding siblings ...)
  2010-11-23 23:03 ` [Qemu-devel] [PATCH 09/10] Exit loop if we have been there too long Juan Quintela
@ 2010-11-23 23:03 ` Juan Quintela
       [not found]   ` <4CF45DE0.8020701@codemonkey.ws>
  9 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-23 23:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela

From: Juan Quintela <quintela@trasno.org>

Calculate the number of dirty pages takes a lot on hosts with lots
of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
if number is small enough.

Signed-off-by: Juan Quintela <quintela@trasno.org>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch_init.c |   15 +--------------
 cpu-all.h   |    7 +++++++
 exec.c      |    1 +
 3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index b463798..c2bc144 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -176,20 +176,7 @@ static uint64_t bytes_transferred;

 static uint64_t ram_save_remaining(void)
 {
-    RAMBlock *block;
-    uint64_t count = 0;
-
-    QLIST_FOREACH(block, &ram_list.blocks, next) {
-        ram_addr_t addr;
-        for (addr = block->offset; addr < block->offset + block->length;
-             addr += TARGET_PAGE_SIZE) {
-            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
-                count++;
-            }
-        }
-    }
-
-    return count;
+    return ram_list.dirty_pages;
 }

 uint64_t ram_bytes_remaining(void)
diff --git a/cpu-all.h b/cpu-all.h
index 30ae17d..5dc27c6 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -874,6 +874,7 @@ typedef struct RAMBlock {

 typedef struct RAMList {
     uint8_t *phys_dirty;
+    uint64_t dirty_pages;
     QLIST_HEAD(ram, RAMBlock) blocks;
 } RAMList;
 extern RAMList ram_list;
@@ -922,6 +923,9 @@ static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,

 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
+    if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
+        ram_list.dirty_pages++;
+
     ram_list.phys_dirty[addr >> TARGET_PAGE_BITS] = 0xff;
 }

@@ -942,6 +946,9 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
     mask = ~dirty_flags;
     p = ram_list.phys_dirty + (start >> TARGET_PAGE_BITS);
     for (i = 0; i < len; i++) {
+        if (cpu_physical_memory_get_dirty(start + i * TARGET_PAGE_SIZE,
+                                          MIGRATION_DIRTY_FLAG & dirty_flags))
+            ram_list.dirty_pages--;
         p[i] &= mask;
     }
 }
diff --git a/exec.c b/exec.c
index f5b2386..ca2506e 100644
--- a/exec.c
+++ b/exec.c
@@ -2866,6 +2866,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
                                        last_ram_offset() >> TARGET_PAGE_BITS);
     memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
            0xff, size >> TARGET_PAGE_BITS);
+    ram_list.dirty_pages += size >> TARGET_PAGE_BITS;

     if (kvm_enabled())
         kvm_setup_guest_memory(new_block->host, size);
-- 
1.7.3.2

^ permalink raw reply related	[flat|nested] 75+ messages in thread

[parent not found: <4CF45DE0.8020701@codemonkey.ws>]

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
       [not found]   ` <4CF45DE0.8020701@codemonkey.ws>
@ 2010-11-30 14:46     ` Juan Quintela
  2010-12-01 14:46       ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-11-30 14:46 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/23/2010 05:03 PM, Juan Quintela wrote:
>> From: Juan Quintela<quintela@trasno.org>
>>
>> Calculate the number of dirty pages takes a lot on hosts with lots
>> of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
>> if number is small enough.
>>    
>
> There needs to be numbers that justify this as part of the commit message.

They are on patch 0/6.

Additionally, with 64GB of RAM, this bitmap is HUGE, having to walk over
it in each ram_save_live() call is too onerous.

> This only works for KVM because it happens to use set_dirty() whenever
> it dirties memory.  I don't think will work with TCG.

The TCG is already broken with respect to migration.  Nothing else
sets that bitmap nowadays.

> I also think that the fundamental problem is the kernel dirty bitmap.
> Perhaps it should return a multiple level table instead of a simple
> linear bitmap.

KVM interface is suboptimal, no doubt here.  But the bigger problem is
qemu implementation.  Having a byte arry where it only uses 2 bits is
wasteful at least.  And having to transform the array forth<->back with
kvm is worse still.

Longer term my plan is:
- make kvm return the number of dirty pages somehow (i.e. no need to
  calculate them)
- split bitmaps, migration bitmap only needs to be handled during
  migration, and CODE bitmap only for TCG.
- interface with kvm would better be something like array of pages
  or array (pages, count).  But I need to measure if there is any
  differences/improvements with one/another.

We were discussing yesterday what made ram_live_block() staying on top
of the profiles, it was this.  This makes ram_save_live() to dissapear
from profiles.

Later, Juan.

> Regards,
>
> Anthony Liguori
>
>> Signed-off-by: Juan Quintela<quintela@trasno.org>
>> Signed-off-by: Juan Quintela<quintela@redhat.com>
>> ---
>>   arch_init.c |   15 +--------------
>>   cpu-all.h   |    7 +++++++
>>   exec.c      |    1 +
>>   3 files changed, 9 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index b463798..c2bc144 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -176,20 +176,7 @@ static uint64_t bytes_transferred;
>>
>>   static uint64_t ram_save_remaining(void)
>>   {
>> -    RAMBlock *block;
>> -    uint64_t count = 0;
>> -
>> -    QLIST_FOREACH(block,&ram_list.blocks, next) {
>> -        ram_addr_t addr;
>> -        for (addr = block->offset; addr<  block->offset + block->length;
>> -             addr += TARGET_PAGE_SIZE) {
>> -            if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
>> -                count++;
>> -            }
>> -        }
>> -    }
>> -
>> -    return count;
>> +    return ram_list.dirty_pages;
>>   }
>>
>>   uint64_t ram_bytes_remaining(void)
>> diff --git a/cpu-all.h b/cpu-all.h
>> index 30ae17d..5dc27c6 100644
>> --- a/cpu-all.h
>> +++ b/cpu-all.h
>> @@ -874,6 +874,7 @@ typedef struct RAMBlock {
>>
>>   typedef struct RAMList {
>>       uint8_t *phys_dirty;
>> +    uint64_t dirty_pages;
>>       QLIST_HEAD(ram, RAMBlock) blocks;
>>   } RAMList;
>>   extern RAMList ram_list;
>> @@ -922,6 +923,9 @@ static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>>
>>   static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>>   {
>> +    if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
>> +        ram_list.dirty_pages++;
>> +
>>       ram_list.phys_dirty[addr>>  TARGET_PAGE_BITS] = 0xff;
>>   }
>>
>> @@ -942,6 +946,9 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
>>       mask = ~dirty_flags;
>>       p = ram_list.phys_dirty + (start>>  TARGET_PAGE_BITS);
>>       for (i = 0; i<  len; i++) {
>> +        if (cpu_physical_memory_get_dirty(start + i * TARGET_PAGE_SIZE,
>> +                                          MIGRATION_DIRTY_FLAG&  dirty_flags))
>> +            ram_list.dirty_pages--;
>>           p[i]&= mask;
>>       }
>>   }
>> diff --git a/exec.c b/exec.c
>> index f5b2386..ca2506e 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2866,6 +2866,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
>>                                          last_ram_offset()>>  TARGET_PAGE_BITS);
>>       memset(ram_list.phys_dirty + (new_block->offset>>  TARGET_PAGE_BITS),
>>              0xff, size>>  TARGET_PAGE_BITS);
>> +    ram_list.dirty_pages += size>>  TARGET_PAGE_BITS;
>>
>>       if (kvm_enabled())
>>           kvm_setup_guest_memory(new_block->host, size);
>>    

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-11-30 14:46     ` [Qemu-devel] " Juan Quintela
@ 2010-12-01 14:46       ` Avi Kivity
  2010-12-01 15:51         ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2010-12-01 14:46 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

On 11/30/2010 04:46 PM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
> >  On 11/23/2010 05:03 PM, Juan Quintela wrote:
> >>  From: Juan Quintela<quintela@trasno.org>
> >>
> >>  Calculate the number of dirty pages takes a lot on hosts with lots
> >>  of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
> >>  if number is small enough.
> >>
> >
> >  There needs to be numbers that justify this as part of the commit message.
>
> They are on patch 0/6.
>
> Additionally, with 64GB of RAM, this bitmap is HUGE, having to walk over
> it in each ram_save_live() call is too onerous.

It's not so huge.  It's scaled down by a factor of 8 * 4096 = 32K.  So 
it's a 2MB bitmap.  If kept as a bitmap and accessed in longs, it can be 
read in less than a millisecond.

An optimization can be to look at the previous ram_save_live (which had 
to walk the bitmap).  If old_nr_dirty > 4*target_nr_dirty, assume we 
need one more pass and don't scan the bitmap.

Another optimization is to stop the count when we reach the target; 
instead of ram_save_remaining() have a ram_save_may_stop() which counts 
the number of dirty bits until it reaches target_nr_dirty or exhausts 
the bitmap.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 14:46       ` Avi Kivity
@ 2010-12-01 15:51         ` Juan Quintela
  2010-12-01 15:55           ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-12-01 15:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity <avi@redhat.com> wrote:
> On 11/30/2010 04:46 PM, Juan Quintela wrote:
>> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>> >  On 11/23/2010 05:03 PM, Juan Quintela wrote:
>> >>  From: Juan Quintela<quintela@trasno.org>
>> >>
>> >>  Calculate the number of dirty pages takes a lot on hosts with lots
>> >>  of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
>> >>  if number is small enough.
>> >>
>> >
>> >  There needs to be numbers that justify this as part of the commit message.
>>
>> They are on patch 0/6.
>>
>> Additionally, with 64GB of RAM, this bitmap is HUGE, having to walk over
>> it in each ram_save_live() call is too onerous.
>
> It's not so huge.  It's scaled down by a factor of 8 * 4096 = 32K.  So
> it's a 2MB bitmap.  If kept as a bitmap and accessed in longs, it can
> be read in less than a millisecond.

Going to undertand why we need the other bitmaps for kvm.

Basically we have:
- CODE bit: kvm don't care
- VGA bit: not really sure how much we care.  My understanding is that
  the bitmap for the VGA don't need to be as big as all memory.
- MIGRATION bit: we care O:-)

Now, we are setting the MIGRATION in three ways:
- kvm (we sync with that)
- vhost net: it uses a different mechanism, not shared with kvm
- *write[blw].  My understanding is that we care about this ones.  Do we
  really care?

We are also syncing too much bitmaps, we can do a bit less syncing.


> An optimization can be to look at the previous ram_save_live (which
> had to walk the bitmap).  If old_nr_dirty > 4*target_nr_dirty, assume
> we need one more pass and don't scan the bitmap.

We had a similar optimization on rhel5.  We only synched the bitmap each
time that we passed over address 0.  And each time that we called
ram_save_remaining().

Trivial optimization for ram_save_reamaining is to:
- maintain the number of pages that are dirty (we only modify the bitmap
 in two places, trivial to inc/dec the total number of dirty pages).
- on ram_save_live only sync bitmap if we are about to exit from the
  loop.  If pages dirty are still bigger that the one we need to go to
  non-save life, it makes no sense to sync.

> Another optimization is to stop the count when we reach the target;
> instead of ram_save_remaining() have a ram_save_may_stop() which
> counts the number of dirty bits until it reaches target_nr_dirty or
> exhausts the bitmap.

The problem here is that guest want to continue running, and continues
dirtying pages.  So, no obvious place where to stop counting.  Or I am
losing something?

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 15:51         ` Juan Quintela
@ 2010-12-01 15:55           ` Anthony Liguori
  2010-12-01 16:25             ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-12-01 15:55 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Avi Kivity, qemu-devel

On 12/01/2010 09:51 AM, Juan Quintela wrote:
> Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 11/30/2010 04:46 PM, Juan Quintela wrote:
>>      
>>> Anthony Liguori<anthony@codemonkey.ws>   wrote:
>>>        
>>>>   On 11/23/2010 05:03 PM, Juan Quintela wrote:
>>>>          
>>>>>   From: Juan Quintela<quintela@trasno.org>
>>>>>
>>>>>   Calculate the number of dirty pages takes a lot on hosts with lots
>>>>>   of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
>>>>>   if number is small enough.
>>>>>
>>>>>            
>>>>   There needs to be numbers that justify this as part of the commit message.
>>>>          
>>> They are on patch 0/6.
>>>
>>> Additionally, with 64GB of RAM, this bitmap is HUGE, having to walk over
>>> it in each ram_save_live() call is too onerous.
>>>        
>> It's not so huge.  It's scaled down by a factor of 8 * 4096 = 32K.  So
>> it's a 2MB bitmap.  If kept as a bitmap and accessed in longs, it can
>> be read in less than a millisecond.

BTW, by this logic, even a 1-byte dirty bitmap is only 16mb which can be 
read in less than 16ms so where is the reported 24 minute stall coming from?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 15:55           ` Anthony Liguori
@ 2010-12-01 16:25             ` Juan Quintela
  2010-12-01 16:33               ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Juan Quintela @ 2010-12-01 16:25 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 12/01/2010 09:51 AM, Juan Quintela wrote:
>> Avi Kivity<avi@redhat.com>  wrote:
>>    
>>> On 11/30/2010 04:46 PM, Juan Quintela wrote:
>>>      
>>>> Anthony Liguori<anthony@codemonkey.ws>   wrote:
>>>>        
>>>>>   On 11/23/2010 05:03 PM, Juan Quintela wrote:
>>>>>          
>>>>>>   From: Juan Quintela<quintela@trasno.org>
>>>>>>
>>>>>>   Calculate the number of dirty pages takes a lot on hosts with lots
>>>>>>   of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
>>>>>>   if number is small enough.
>>>>>>
>>>>>>            
>>>>>   There needs to be numbers that justify this as part of the commit message.
>>>>>          
>>>> They are on patch 0/6.
>>>>
>>>> Additionally, with 64GB of RAM, this bitmap is HUGE, having to walk over
>>>> it in each ram_save_live() call is too onerous.
>>>>        
>>> It's not so huge.  It's scaled down by a factor of 8 * 4096 = 32K.  So
>>> it's a 2MB bitmap.  If kept as a bitmap and accessed in longs, it can
>>> be read in less than a millisecond.
>
> BTW, by this logic, even a 1-byte dirty bitmap is only 16mb which can
> be read in less than 16ms so where is the reported 24 minute stall
> coming from?

a) we read the bitmap more than once
b) the 1ms is based on "we read" it with longs and optimized, just now
   we have to read it by byte and inside the byte.

Later, Juan.

>
> Regards,
>
> Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 16:25             ` Juan Quintela
@ 2010-12-01 16:33               ` Anthony Liguori
  2010-12-01 16:43                 ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-12-01 16:33 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Avi Kivity, qemu-devel

On 12/01/2010 10:25 AM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 12/01/2010 09:51 AM, Juan Quintela wrote:
>>      
>>> Avi Kivity<avi@redhat.com>   wrote:
>>>
>>>        
>>>> On 11/30/2010 04:46 PM, Juan Quintela wrote:
>>>>
>>>>          
>>>>> Anthony Liguori<anthony@codemonkey.ws>    wrote:
>>>>>
>>>>>            
>>>>>>    On 11/23/2010 05:03 PM, Juan Quintela wrote:
>>>>>>
>>>>>>              
>>>>>>>    From: Juan Quintela<quintela@trasno.org>
>>>>>>>
>>>>>>>    Calculate the number of dirty pages takes a lot on hosts with lots
>>>>>>>    of memory.  Just maintain how many pages are dirty.  Only sync bitmaps
>>>>>>>    if number is small enough.
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>    There needs to be numbers that justify this as part of the commit message.
>>>>>>
>>>>>>              
>>>>> They are on patch 0/6.
>>>>>
>>>>> Additionally, with 64GB of RAM, this bitmap is HUGE, having to walk over
>>>>> it in each ram_save_live() call is too onerous.
>>>>>
>>>>>            
>>>> It's not so huge.  It's scaled down by a factor of 8 * 4096 = 32K.  So
>>>> it's a 2MB bitmap.  If kept as a bitmap and accessed in longs, it can
>>>> be read in less than a millisecond.
>>>>          
>> BTW, by this logic, even a 1-byte dirty bitmap is only 16mb which can
>> be read in less than 16ms so where is the reported 24 minute stall
>> coming from?
>>      
> a) we read the bitmap more than once
>    

Not in a single iteration which is what the "stall" would consist of.

> b) the 1ms is based on "we read" it with longs and optimized, just now
>     we have to read it by byte and inside the byte.
>    

Byte accesses verse long access doesn't turn 16ms into 24 minutes.

Regards,

Anthony Liguori

> Later, Juan.
>
>    
>> Regards,
>>
>> Anthony Liguori
>>      

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 16:33               ` Anthony Liguori
@ 2010-12-01 16:43                 ` Avi Kivity
  2010-12-01 16:49                   ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2010-12-01 16:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Juan Quintela

On 12/01/2010 06:33 PM, Anthony Liguori wrote:
>>> BTW, by this logic, even a 1-byte dirty bitmap is only 16mb which can
>>> be read in less than 16ms so where is the reported 24 minute stall
>>> coming from?
>> a) we read the bitmap more than once
>
> Not in a single iteration which is what the "stall" would consist of.
>
>> b) the 1ms is based on "we read" it with longs and optimized, just now
>>     we have to read it by byte and inside the byte.
>
> Byte accesses verse long access doesn't turn 16ms into 24 minutes.

We need actual measurements instead of speculations.  Walking the dirty 
bitmap _did_ _not_ introduce any stalls.  It was the qemu mutex, or 
walking kvm data structures in the kernel, or something.  No amount of 
dirty bitmap optimization will fix it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 16:43                 ` Avi Kivity
@ 2010-12-01 16:49                   ` Anthony Liguori
  2010-12-01 16:52                     ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-12-01 16:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Juan Quintela

On 12/01/2010 10:43 AM, Avi Kivity wrote:
> On 12/01/2010 06:33 PM, Anthony Liguori wrote:
>>>> BTW, by this logic, even a 1-byte dirty bitmap is only 16mb which can
>>>> be read in less than 16ms so where is the reported 24 minute stall
>>>> coming from?
>>> a) we read the bitmap more than once
>>
>> Not in a single iteration which is what the "stall" would consist of.
>>
>>> b) the 1ms is based on "we read" it with longs and optimized, just now
>>>     we have to read it by byte and inside the byte.
>>
>> Byte accesses verse long access doesn't turn 16ms into 24 minutes.
>
> We need actual measurements instead of speculations.

Yes, I agree 100%.  I think the place to start is what I suggested in a 
previous note in this thread, we need to measure actual stall time in 
the guest.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 16:49                   ` Anthony Liguori
@ 2010-12-01 16:52                     ` Avi Kivity
  2010-12-01 16:56                       ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2010-12-01 16:52 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Juan Quintela

On 12/01/2010 06:49 PM, Anthony Liguori wrote:
>> We need actual measurements instead of speculations.
>
>
> Yes, I agree 100%.  I think the place to start is what I suggested in 
> a previous note in this thread, we need to measure actual stall time 
> in the guest.

I'd actually start at the host.  How much time does 
ioctl(KVM_GET_DIRTY_LOG) take?  What's the percentage of time qemu_mutex 
is held?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 16:52                     ` Avi Kivity
@ 2010-12-01 16:56                       ` Anthony Liguori
  2010-12-01 17:01                         ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-12-01 16:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Juan Quintela

On 12/01/2010 10:52 AM, Avi Kivity wrote:
> On 12/01/2010 06:49 PM, Anthony Liguori wrote:
>>> We need actual measurements instead of speculations.
>>
>>
>> Yes, I agree 100%.  I think the place to start is what I suggested in 
>> a previous note in this thread, we need to measure actual stall time 
>> in the guest.
>
> I'd actually start at the host.  How much time does 
> ioctl(KVM_GET_DIRTY_LOG) take?  What's the percentage of time 
> qemu_mutex is held?

The question is, what really are the symptoms of the problem.  It's not 
necessarily a bad thing if KVM_GET_DIRTY_LOG takes a long while 
qemu_mutex is held.

Is the problem that the monitor responds slowly?  Is the problem that 
the guest isn't consistently getting execution time?  Is the proper 
simply that the guest isn't getting enough total execution time?

I think we need to identify exactly what the problem is before we look 
for sources.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 16:56                       ` Anthony Liguori
@ 2010-12-01 17:01                         ` Avi Kivity
  2010-12-01 17:05                           ` Anthony Liguori
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2010-12-01 17:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Juan Quintela

On 12/01/2010 06:56 PM, Anthony Liguori wrote:
> On 12/01/2010 10:52 AM, Avi Kivity wrote:
>> On 12/01/2010 06:49 PM, Anthony Liguori wrote:
>>>> We need actual measurements instead of speculations.
>>>
>>>
>>> Yes, I agree 100%.  I think the place to start is what I suggested 
>>> in a previous note in this thread, we need to measure actual stall 
>>> time in the guest.
>>
>> I'd actually start at the host.  How much time does 
>> ioctl(KVM_GET_DIRTY_LOG) take?  What's the percentage of time 
>> qemu_mutex is held?
>
> The question is, what really are the symptoms of the problem.  It's 
> not necessarily a bad thing if KVM_GET_DIRTY_LOG takes a long while 
> qemu_mutex is held.

Whether or not qemu_mutex is held, long KVM_GET_DIRTY_LONG runtimes are 
bad, since they are a lower bound on your downtime.  And 
KVM_GET_DIRTY_LOG does a lot of work, and invokes 
synchronize_srcu_expedited(), which can be very slow.

>
> Is the problem that the monitor responds slowly?  Is the problem that 
> the guest isn't consistently getting execution time?  Is the proper 
> simply that the guest isn't getting enough total execution time?

All three can happen if qemu_mutex is held too long.  The third can 
happen for other reasons (like KVM_GET_DIRTY_LOG holding the mmu 
spinlock for too long, can fix with O(1) write protection).

>
> I think we need to identify exactly what the problem is before we look 
> for sources.
>


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 17:01                         ` Avi Kivity
@ 2010-12-01 17:05                           ` Anthony Liguori
  2010-12-01 18:51                             ` Juan Quintela
  0 siblings, 1 reply; 75+ messages in thread
From: Anthony Liguori @ 2010-12-01 17:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Juan Quintela

On 12/01/2010 11:01 AM, Avi Kivity wrote:
> On 12/01/2010 06:56 PM, Anthony Liguori wrote:
>> On 12/01/2010 10:52 AM, Avi Kivity wrote:
>>> On 12/01/2010 06:49 PM, Anthony Liguori wrote:
>>>>> We need actual measurements instead of speculations.
>>>>
>>>>
>>>> Yes, I agree 100%.  I think the place to start is what I suggested 
>>>> in a previous note in this thread, we need to measure actual stall 
>>>> time in the guest.
>>>
>>> I'd actually start at the host.  How much time does 
>>> ioctl(KVM_GET_DIRTY_LOG) take?  What's the percentage of time 
>>> qemu_mutex is held?
>>
>> The question is, what really are the symptoms of the problem.  It's 
>> not necessarily a bad thing if KVM_GET_DIRTY_LOG takes a long while 
>> qemu_mutex is held.
>
> Whether or not qemu_mutex is held, long KVM_GET_DIRTY_LONG runtimes 
> are bad, since they are a lower bound on your downtime.  And 
> KVM_GET_DIRTY_LOG does a lot of work, and invokes 
> synchronize_srcu_expedited(), which can be very slow.

That's fine, and you're right, it's a useful thing to do, but this 
series originated because of a problem and we ought to make sure we 
capture what the actual problem is.  That's not to say we shouldn't 
improve things that could stand to be improved.

>>
>> Is the problem that the monitor responds slowly?  Is the problem that 
>> the guest isn't consistently getting execution time?  Is the proper 
>> simply that the guest isn't getting enough total execution time?
>
> All three can happen if qemu_mutex is held too long.

Right, but I'm starting to think that the root of the problem is not 
that it's being held too long but that it's being held too often.

Regards,

Anthony Liguori

>   The third can happen for other reasons (like KVM_GET_DIRTY_LOG 
> holding the mmu spinlock for too long, can fix with O(1) write 
> protection).
>
>>
>> I think we need to identify exactly what the problem is before we 
>> look for sources.
>>
>
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Qemu-devel] Re: [PATCH 10/10] Maintaing number of dirty pages
  2010-12-01 17:05                           ` Anthony Liguori
@ 2010-12-01 18:51                             ` Juan Quintela
  0 siblings, 0 replies; 75+ messages in thread
From: Juan Quintela @ 2010-12-01 18:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 12/01/2010 11:01 AM, Avi Kivity wrote:
>> On 12/01/2010 06:56 PM, Anthony Liguori wrote:
>>> On 12/01/2010 10:52 AM, Avi Kivity wrote:
>>>> On 12/01/2010 06:49 PM, Anthony Liguori wrote:
>>>>>> We need actual measurements instead of speculations.
>>>>>
>>>>>
>>>>> Yes, I agree 100%.  I think the place to start is what I
>>>>> suggested in a previous note in this thread, we need to measure
>>>>> actual stall time in the guest.
>>>>
>>>> I'd actually start at the host.  How much time does
>>>> ioctl(KVM_GET_DIRTY_LOG) take?  What's the percentage of time
>>>> qemu_mutex is held?
>>>
>>> The question is, what really are the symptoms of the problem.  It's
>>> not necessarily a bad thing if KVM_GET_DIRTY_LOG takes a long while
>>> qemu_mutex is held.
>>
>> Whether or not qemu_mutex is held, long KVM_GET_DIRTY_LONG runtimes
>> are bad, since they are a lower bound on your downtime.  And
>> KVM_GET_DIRTY_LOG does a lot of work, and invokes
>> synchronize_srcu_expedited(), which can be very slow.
>
> That's fine, and you're right, it's a useful thing to do, but this
> series originated because of a problem and we ought to make sure we
> capture what the actual problem is.  That's not to say we shouldn't
> improve things that could stand to be improved.
>
>>>
>>> Is the problem that the monitor responds slowly?  Is the problem
>>> that the guest isn't consistently getting execution time?  Is the
>>> proper simply that the guest isn't getting enough total execution
>>> time?
>>
>> All three can happen if qemu_mutex is held too long.
>
> Right, but I'm starting to think that the root of the problem is not
> that it's being held too long but that it's being held too often.

Ok, I tested yesterday dropping qemu_mutex on ram_save_block (crude
thing, just qemu_mutex_unlock_iothread(); loop ;
qemu_mutex_lock_iothread();

As requested by Anthony, I tested on the guest to see how big stalls
were.  Code is:

        while (1) {
                if (gettimeofday(&t0, NULL) != 0)
                        perror("gettimeofday 1");
                if (usleep(100) != 0)
                        perror("usleep");
                if (gettimeofday(&t1, NULL) != 0)
                        perror("gettimeofday 2");
                t1.tv_usec -= t0.tv_usec;
                if (t1.tv_usec < 0) {
                        t1.tv_usec += 1000000;
                        t1.tv_sec--;
                }
                t1.tv_sec -= t0.tv_sec;

                if (t1.tv_sec || t1.tv_usec > 5000)
                        printf("delay of %ld\n", t1.tv_sec * 1000000 + t1.tv_usec);
        }

I tried in a guest that is completely idle with 8vcpus.  on idle, only
some stalls in the 5-8ms happens (as expected).

(this is after my series).

As soon as I start migration, we got several stalls in the 15-200ms
range.  Notice that stalls are not bigger because I limit the time that
qemu_mutex is held on the iothread to 50ms each time.

doing the crude qemu_mutex drop on ram_save_live, means that this
ministalls got way smaller in the 10-15ms range (some rare at 20ms).

And then we have an stall of around 120ms during the non-live part of
the migration.  I can't find where this stall comes from (i.e. saving
all the rest of pages and normal sections take much less time).  But on
the other hand, I have no instrumentation yet to measure how long it
takes to move to the other host and restart there.

So, we are still not there, but now we have only a single 120ms stall on
the guest, versus the 1-4 seconds ones that we used to have.

I don't have access to this machines until next week, so I am spending
this week implementing the ideas given on this thread.

Later, Juan.

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2010-12-02  8:38 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
2010-11-23 23:02 ` [Qemu-devel] [PATCH 01/10] Add spent time to migration Juan Quintela
2010-11-23 23:02 ` [Qemu-devel] [PATCH 02/10] Add buffered_file_internal constant Juan Quintela
2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-24 10:52     ` Juan Quintela
2010-11-24 11:04       ` Michael S. Tsirkin
2010-11-24 11:13         ` Juan Quintela
2010-11-24 11:19           ` Michael S. Tsirkin
     [not found]       ` <4CF46012.2060804@codemonkey.ws>
2010-11-30 11:56         ` Juan Quintela
2010-11-30 14:02           ` Anthony Liguori
2010-11-30 14:11             ` Michael S. Tsirkin
2010-11-30 14:22               ` Anthony Liguori
2010-11-30 15:40             ` Juan Quintela
2010-11-30 16:10               ` Michael S. Tsirkin
2010-11-30 16:32                 ` Juan Quintela
2010-11-30 16:44                   ` Anthony Liguori
2010-11-30 18:04                     ` Juan Quintela
2010-11-30 18:54                       ` Anthony Liguori
2010-11-30 19:15                         ` Juan Quintela
2010-11-30 20:23                           ` Anthony Liguori
2010-11-30 20:56                             ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 03/10] Add printf debug to savevm Juan Quintela
     [not found]   ` <4CF45AB2.7050506@codemonkey.ws>
2010-11-30 10:36     ` Stefan Hajnoczi
2010-11-30 22:40       ` [Qemu-devel] " Juan Quintela
2010-12-01  7:50         ` Stefan Hajnoczi
2010-11-23 23:03 ` [Qemu-devel] [PATCH 04/10] No need to iterate if we already are over the limit Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 05/10] KVM don't care about TLB handling Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 06/10] Only calculate expected_time for stage 2 Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 07/10] ram_save_remaining() returns an uint64_t Juan Quintela
     [not found]   ` <4CF45C0C.705@codemonkey.ws>
2010-11-30  7:21     ` [Qemu-devel] " Paolo Bonzini
2010-11-30 13:44       ` Anthony Liguori
2010-11-30 14:38     ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 08/10] Count nanoseconds with uint64_t not doubles Juan Quintela
2010-11-30  7:17   ` [Qemu-devel] " Paolo Bonzini
     [not found]   ` <4CF45C5B.9080507@codemonkey.ws>
2010-11-30 14:40     ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 09/10] Exit loop if we have been there too long Juan Quintela
2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-24 11:01     ` Juan Quintela
2010-11-24 11:14       ` Michael S. Tsirkin
2010-11-24 15:16         ` Paolo Bonzini
2010-11-24 15:59           ` Michael S. Tsirkin
     [not found]           ` <4CF45E3F.4040609@codemonkey.ws>
2010-11-30  8:10             ` Paolo Bonzini
2010-11-30 13:26             ` Juan Quintela
     [not found]   ` <4CF45D67.5010906@codemonkey.ws>
2010-11-30  7:15     ` Paolo Bonzini
2010-11-30 13:47       ` Anthony Liguori
2010-11-30 13:58         ` Avi Kivity
2010-11-30 14:17           ` Anthony Liguori
2010-11-30 14:27             ` Avi Kivity
2010-11-30 14:50               ` Anthony Liguori
2010-12-01 12:40                 ` Avi Kivity
2010-11-30 17:43               ` Juan Quintela
2010-12-01  1:20               ` Takuya Yoshikawa
2010-12-01  1:52                 ` Juan Quintela
2010-12-01  2:22                   ` Takuya Yoshikawa
2010-12-01 12:35                   ` Avi Kivity
2010-12-01 13:45                     ` Juan Quintela
2010-12-02  1:31                     ` Takuya Yoshikawa
2010-12-02  8:37                       ` Avi Kivity
2010-11-30 14:12         ` Paolo Bonzini
2010-11-30 15:00           ` Anthony Liguori
2010-11-30 17:59             ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 10/10] Maintaing number of dirty pages Juan Quintela
     [not found]   ` <4CF45DE0.8020701@codemonkey.ws>
2010-11-30 14:46     ` [Qemu-devel] " Juan Quintela
2010-12-01 14:46       ` Avi Kivity
2010-12-01 15:51         ` Juan Quintela
2010-12-01 15:55           ` Anthony Liguori
2010-12-01 16:25             ` Juan Quintela
2010-12-01 16:33               ` Anthony Liguori
2010-12-01 16:43                 ` Avi Kivity
2010-12-01 16:49                   ` Anthony Liguori
2010-12-01 16:52                     ` Avi Kivity
2010-12-01 16:56                       ` Anthony Liguori
2010-12-01 17:01                         ` Avi Kivity
2010-12-01 17:05                           ` Anthony Liguori
2010-12-01 18:51                             ` Juan Quintela

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).