[PATCH v2 0/6] colo: migration related bugfixes

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] colo: migration related bugfixes
@ 2020-05-20 20:42 Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

Hello Everyone,
Here are fixes for bugs that I found in my tests.

Regards,
Lukas Straub

v2:
 -Don't touch qemu_file_rate_limit, solve the problem in migration_rate_limit

Lukas Straub (6):
  migration/colo.c: Use event instead of semaphore
  migration/colo.c: Use cpu_synchronize_all_states()
  migration/colo.c: Flush ram cache only after receiving device state
  migration/colo.c: Relaunch failover even if there was an error
  migration/colo.c: Move colo_notify_compares_event to the right place
  migration/migration.c: Fix hang in ram_save_host_page

 migration/colo.c      | 39 ++++++++++++++++++++++++---------------
 migration/migration.c |  4 ++++
 migration/migration.h |  4 ++--
 migration/ram.c       |  5 +----
 migration/ram.h       |  1 +
 5 files changed, 32 insertions(+), 21 deletions(-)

--
2.20.1

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/6] migration/colo.c: Use event instead of semaphore
  2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
@ 2020-05-20 20:42 ` Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3278 bytes --]

If multiple packets miscompare in a short timeframe, the semaphore
value will be increased multiple times. This causes multiple
checkpoints even if one would be sufficient.

Fix this by using a event instead of a semaphore for triggering
checkpoints. Now, checkpoint requests will be ignored until the
checkpoint event is sent to colo-compare (which releases the
miscompared packets).

Benchmark results (iperf3):
Client-to-server tcp:
without patch: ~66 Mbit/s
with patch: ~61 Mbit/s
Server-to-client tcp:
without patch: ~702 Kbit/s
with patch: ~16 Mbit/s

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
---
 migration/colo.c      | 9 +++++----
 migration/migration.h | 4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index a54ac84f41..09168627bc 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -430,6 +430,7 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }

+    qemu_event_reset(&s->colo_checkpoint_event);
     colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
     if (local_err) {
         goto out;
@@ -580,7 +581,7 @@ static void colo_process_checkpoint(MigrationState *s)
             goto out;
         }

-        qemu_sem_wait(&s->colo_checkpoint_sem);
+        qemu_event_wait(&s->colo_checkpoint_event);

         if (s->state != MIGRATION_STATUS_COLO) {
             goto out;
@@ -628,7 +629,7 @@ out:
     colo_compare_unregister_notifier(&packets_compare_notifier);
     timer_del(s->colo_delay_timer);
     timer_free(s->colo_delay_timer);
-    qemu_sem_destroy(&s->colo_checkpoint_sem);
+    qemu_event_destroy(&s->colo_checkpoint_event);

     /*
      * Must be called after failover BH is completed,
@@ -645,7 +646,7 @@ void colo_checkpoint_notify(void *opaque)
     MigrationState *s = opaque;
     int64_t next_notify_time;

-    qemu_sem_post(&s->colo_checkpoint_sem);
+    qemu_event_set(&s->colo_checkpoint_event);
     s->colo_checkpoint_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
     next_notify_time = s->colo_checkpoint_time +
                     s->parameters.x_checkpoint_delay;
@@ -655,7 +656,7 @@ void colo_checkpoint_notify(void *opaque)
 void migrate_start_colo_process(MigrationState *s)
 {
     qemu_mutex_unlock_iothread();
-    qemu_sem_init(&s->colo_checkpoint_sem, 0);
+    qemu_event_init(&s->colo_checkpoint_event, false);
     s->colo_delay_timer =  timer_new_ms(QEMU_CLOCK_HOST,
                                 colo_checkpoint_notify, s);

diff --git a/migration/migration.h b/migration/migration.h
index 507284e563..f617960522 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -215,8 +215,8 @@ struct MigrationState
     /* The semaphore is used to notify COLO thread that failover is finished */
     QemuSemaphore colo_exit_sem;

-    /* The semaphore is used to notify COLO thread to do checkpoint */
-    QemuSemaphore colo_checkpoint_sem;
+    /* The event is used to notify COLO thread to do checkpoint */
+    QemuEvent colo_checkpoint_event;
     int64_t colo_checkpoint_time;
     QEMUTimer *colo_delay_timer;

--
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/6] migration/colo.c: Use cpu_synchronize_all_states()
  2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
@ 2020-05-20 20:42 ` Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1102 bytes --]

cpu_synchronize_all_pre_loadvm() marks all vcpus as dirty, so the
registers are loaded from CPUState before we continue running
the vm. However if we failover during checkpoint, CPUState is not
initialized and the registers are loaded with garbage. This causes
guest hangs and crashes.

Fix this by using cpu_synchronize_all_states(), which initializes
CPUState from the current cpu registers additionally to marking
the vcpus as dirty.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/colo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/colo.c b/migration/colo.c
index 09168627bc..6b2ad35aa4 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -696,7 +696,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     }

     qemu_mutex_lock_iothread();
-    cpu_synchronize_all_pre_loadvm();
+    cpu_synchronize_all_states();
     ret = qemu_loadvm_state_main(mis->from_src_file, mis);
     qemu_mutex_unlock_iothread();

--
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/6] migration/colo.c: Flush ram cache only after receiving device state
  2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
@ 2020-05-20 20:42 ` Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

If we suceed in receiving ram state, but fail receiving the device
state, there will be a mismatch between the two.

Fix this by flushing the ram cache only after the vmstate has been
received.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
---
 migration/colo.c | 1 +
 migration/ram.c  | 5 +----
 migration/ram.h  | 1 +
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index 6b2ad35aa4..2947363ae5 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -739,6 +739,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,

     qemu_mutex_lock_iothread();
     vmstate_loading = true;
+    colo_flush_ram_cache();
     ret = qemu_load_device_state(fb);
     if (ret < 0) {
         error_setg(errp, "COLO: load device state failed");
diff --git a/migration/ram.c b/migration/ram.c
index 04f13feb2e..5baec5fce9 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3313,7 +3313,7 @@ static bool postcopy_is_running(void)
  * Flush content of RAM cache into SVM's memory.
  * Only flush the pages that be dirtied by PVM or SVM or both.
  */
-static void colo_flush_ram_cache(void)
+void colo_flush_ram_cache(void)
 {
     RAMBlock *block = NULL;
     void *dst_host;
@@ -3585,9 +3585,6 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     }
     trace_ram_load_complete(ret, seq_iter);

-    if (!ret  && migration_incoming_in_colo_state()) {
-        colo_flush_ram_cache();
-    }
     return ret;
 }

diff --git a/migration/ram.h b/migration/ram.h
index 5ceaff7cb4..2eeaacfa13 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -65,6 +65,7 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);

 /* ram cache */
 int colo_init_ram_cache(void);
+void colo_flush_ram_cache(void);
 void colo_release_ram_cache(void);
 void colo_incoming_start_dirty_log(void);

--
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/6] migration/colo.c: Relaunch failover even if there was an error
  2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
                   ` (2 preceding siblings ...)
  2020-05-20 20:42 ` [PATCH v2 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
@ 2020-05-20 20:42 ` Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 5/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page Lukas Straub
  5 siblings, 0 replies; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3207 bytes --]

If vmstate_loading is true, secondary_vm_do_failover will set failover
status to FAILOVER_STATUS_RELAUNCH and return success without initiating
failover. However, if there is an error during the vmstate_loading
section, failover isn't relaunched. Instead we then wait for
failover on colo_incoming_sem.

Fix this by relaunching failover even if there was an error. Also,
to make this work properly, set vmstate_loading to false when
returning during the vmstate_loading section.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
---
 migration/colo.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index 2947363ae5..a69782efc5 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -743,6 +743,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     ret = qemu_load_device_state(fb);
     if (ret < 0) {
         error_setg(errp, "COLO: load device state failed");
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -751,6 +752,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     replication_get_error_all(&local_err);
     if (local_err) {
         error_propagate(errp, local_err);
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -759,6 +761,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     replication_do_checkpoint_all(&local_err);
     if (local_err) {
         error_propagate(errp, local_err);
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -770,6 +773,7 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,

     if (local_err) {
         error_propagate(errp, local_err);
+        vmstate_loading = false;
         qemu_mutex_unlock_iothread();
         return;
     }
@@ -780,9 +784,6 @@ static void colo_incoming_process_checkpoint(MigrationIncomingState *mis,
     qemu_mutex_unlock_iothread();

     if (failover_get_state() == FAILOVER_STATUS_RELAUNCH) {
-        failover_set_state(FAILOVER_STATUS_RELAUNCH,
-                        FAILOVER_STATUS_NONE);
-        failover_request_active(NULL);
         return;
     }

@@ -881,6 +882,14 @@ void *colo_process_incoming_thread(void *opaque)
             error_report_err(local_err);
             break;
         }
+
+        if (failover_get_state() == FAILOVER_STATUS_RELAUNCH) {
+            failover_set_state(FAILOVER_STATUS_RELAUNCH,
+                            FAILOVER_STATUS_NONE);
+            failover_request_active(NULL);
+            break;
+        }
+
         if (failover_get_state() != FAILOVER_STATUS_NONE) {
             error_report("failover request");
             break;
@@ -888,8 +897,6 @@ void *colo_process_incoming_thread(void *opaque)
     }

 out:
-    vmstate_loading = false;
-
     /*
      * There are only two reasons we can get here, some error happened
      * or the user triggered failover.
--
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 5/6] migration/colo.c: Move colo_notify_compares_event to the right place
  2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
                   ` (3 preceding siblings ...)
  2020-05-20 20:42 ` [PATCH v2 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
@ 2020-05-20 20:42 ` Lukas Straub
  2020-05-20 20:42 ` [PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page Lukas Straub
  5 siblings, 0 replies; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]

If the secondary has to failover during checkpointing, it still is
in the old state (i.e. different state than primary). Thus we can't
expose the primary state until after the checkpoint is sent.

This fixes sporadic connection reset of client connections during
failover.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
---
 migration/colo.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index a69782efc5..a3fc21e86e 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -430,12 +430,6 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }

-    qemu_event_reset(&s->colo_checkpoint_event);
-    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
-    if (local_err) {
-        goto out;
-    }
-
     /* Disable block migration */
     migrate_set_block_enabled(false, &local_err);
     qemu_mutex_lock_iothread();
@@ -494,6 +488,12 @@ static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }

+    qemu_event_reset(&s->colo_checkpoint_event);
+    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
     colo_receive_check_message(s->rp_state.from_dst_file,
                        COLO_MESSAGE_VMSTATE_LOADED, &local_err);
     if (local_err) {
--
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page
  2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
                   ` (4 preceding siblings ...)
  2020-05-20 20:42 ` [PATCH v2 5/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
@ 2020-05-20 20:42 ` Lukas Straub
  2020-06-01 16:54   ` Dr. David Alan Gilbert
  5 siblings, 1 reply; 8+ messages in thread
From: Lukas Straub @ 2020-05-20 20:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Hailiang Zhang, Dr. David Alan Gilbert, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 945 bytes --]

migration_rate_limit will erroneously ratelimit a shutdown socket,
which causes the migration thread to hang in ram_save_host_page
if the socket is shutdown.

Fix this by explicitly testing if the socket has errors or was
shutdown in migration_rate_limit.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/migration.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 187ac0410c..e8bd32d48c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3347,6 +3347,10 @@ bool migration_rate_limit(void)
     bool urgent = false;
     migration_update_counters(s, now);
     if (qemu_file_rate_limit(s->to_dst_file)) {
+
+        if (qemu_file_get_error(s->to_dst_file)) {
+            return false;
+        }
         /*
          * Wait for a delay to do rate limiting OR
          * something urgent to post the semaphore.
--
2.20.1

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page
  2020-05-20 20:42 ` [PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page Lukas Straub
@ 2020-06-01 16:54   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2020-06-01 16:54 UTC (permalink / raw)
  To: Lukas Straub; +Cc: Juan Quintela, qemu-devel, Hailiang Zhang

* Lukas Straub (lukasstraub2@web.de) wrote:
> migration_rate_limit will erroneously ratelimit a shutdown socket,
> which causes the migration thread to hang in ram_save_host_page
> if the socket is shutdown.
> 
> Fix this by explicitly testing if the socket has errors or was
> shutdown in migration_rate_limit.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

and queued this one (the others in the series are the same as the v1)


> ---
>  migration/migration.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 187ac0410c..e8bd32d48c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3347,6 +3347,10 @@ bool migration_rate_limit(void)
>      bool urgent = false;
>      migration_update_counters(s, now);
>      if (qemu_file_rate_limit(s->to_dst_file)) {
> +
> +        if (qemu_file_get_error(s->to_dst_file)) {
> +            return false;
> +        }
>          /*
>           * Wait for a delay to do rate limiting OR
>           * something urgent to post the semaphore.
> --
> 2.20.1


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-01 16:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-20 20:42 [PATCH v2 0/6] colo: migration related bugfixes Lukas Straub
2020-05-20 20:42 ` [PATCH v2 1/6] migration/colo.c: Use event instead of semaphore Lukas Straub
2020-05-20 20:42 ` [PATCH v2 2/6] migration/colo.c: Use cpu_synchronize_all_states() Lukas Straub
2020-05-20 20:42 ` [PATCH v2 3/6] migration/colo.c: Flush ram cache only after receiving device state Lukas Straub
2020-05-20 20:42 ` [PATCH v2 4/6] migration/colo.c: Relaunch failover even if there was an error Lukas Straub
2020-05-20 20:42 ` [PATCH v2 5/6] migration/colo.c: Move colo_notify_compares_event to the right place Lukas Straub
2020-05-20 20:42 ` [PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page Lukas Straub
2020-06-01 16:54   ` Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).