[Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
@ 2013-10-15  7:26 Jules Wang
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 1/4] Curling: add doc Jules Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Jules Wang @ 2013-10-15  7:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Jules Wang, owasserm, quintela

v2 -> v3:
* add documentation of new option in qapi-schema.

* long option name: ft -> fault-tolerant

v1 -> v2:
* cmdline: migrate curling:tcp:<address>:<port> 
       ->  migrate -f tcp:<address>:<port>

* sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
          to indicate this is a ft migration.

* receiver: look for the signature: 
            QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
            which indicates the end of one migration.
--
Jules Wang (4):
  Curling: add doc
  Curling: cmdline interface.
  Curling: the sender
  Curling: the receiver

 arch_init.c                   |  25 ++++--
 docs/curling.txt              |  51 ++++++++++++
 hmp-commands.hx               |  10 ++-
 hmp.c                         |   3 +-
 include/migration/migration.h |   1 +
 include/migration/qemu-file.h |   1 +
 include/sysemu/sysemu.h       |   5 +-
 migration.c                   |  50 ++++++++++--
 qapi-schema.json              |   6 +-
 qmp-commands.hx               |   3 +-
 savevm.c                      | 178 +++++++++++++++++++++++++++++++++++++++---
 11 files changed, 303 insertions(+), 30 deletions(-)
 create mode 100644 docs/curling.txt

-- 
1.8.0.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v3 1/4] Curling: add doc
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
@ 2013-10-15  7:26 ` Jules Wang
  2013-10-17 11:25   ` Stefan Hajnoczi
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 2/4] Curling: cmdline interface Jules Wang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Jules Wang @ 2013-10-15  7:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Jules Wang, owasserm, quintela

Curling provides fault tolerant mechanism for KVM.
For more info, see 'doc/curling.txt'.

Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
 docs/curling.txt | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)
 create mode 100644 docs/curling.txt

diff --git a/docs/curling.txt b/docs/curling.txt
new file mode 100644
index 0000000..f506a77
--- /dev/null
+++ b/docs/curling.txt
@@ -0,0 +1,51 @@
+KVM Fault Tolerance Specification
+=================================
+
+
+Contents:
+=========
+* Introduction
+* Usage
+* Design & Implement
+* Performance
+
+Introduction
+============
+The goal of Curling(sports) is to provide a fault tolerant(ft for short)
+mechanism for KVM, so that in the event of a hardware failure, the virtual
+machine fails over to the backup in a way that is completely transparent
+to the guest operating system.
+
+
+Usage
+=====
+The steps of curling are the same as the steps of live migration except the
+following:
+1. Start ft in the qemu monitor of sender vm by following cmdline:
+   > migrate_set_speed <full bandwidth>
+   > migrate -f tcp:<address>:<port>
+2. Connect to the receiver vm by vnc or spice. The screen of the vm is displayed
+when ft is ready.
+3. Now, the sender vm is protected by ft, When it encounters a failure,
+the failover kicks in.
+
+
+
+Design & Implement
+==================
+* By leveraging live migration feature, we do endless live migrations between
+the sender and receiver, so the two virtual machines are synchronized.
+
+* The receiver does not load vm state once the migration begins, instead, it
+perfetches one whole migration data into a buffer, then loads vm state from
+that buffer afterwards. This "all or nothing" approach prevents the
+broken-in-the-middle problem Kemari has.
+
+* The sender sleeps a little while after each migration, to ease the
+performance penalty entailed by vm_stop and iothread locks. This is a
+tradeoff between performance and accuracy.
+....
+
+
+Performance
+===========
-- 
1.8.0.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v3 2/4] Curling: cmdline interface.
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 1/4] Curling: add doc Jules Wang
@ 2013-10-15  7:26 ` Jules Wang
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 3/4] Curling: the sender Jules Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Jules Wang @ 2013-10-15  7:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Jules Wang, owasserm, quintela

Add an option '-f' to migration cmdline.
Indicating whether to enable fault tolerant or not.

Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
 hmp-commands.hx               | 10 ++++++----
 hmp.c                         |  3 ++-
 include/migration/migration.h |  1 +
 migration.c                   |  3 ++-
 qapi-schema.json              |  6 +++++-
 qmp-commands.hx               |  3 ++-
 6 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index caae5ad..e6fa3f7 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -877,23 +877,25 @@ ETEXI
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,fault-tolerant:-f,uri:s",
+        .params     = "[-d] [-b] [-i] [-f] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t -f for fault tolerant mode",
         .mhandler.cmd = hmp_migrate,
     },
 
 
 STEXI
-@item migrate [-d] [-b] [-i] @var{uri}
+@item migrate [-d] [-b] [-i] [-f] @var{uri}
 @findex migrate
 Migrate to @var{uri} (using -d to not wait for completion).
 	-b for migration with full copy of disk
 	-i for migration with incremental copy of disk (base image is shared)
+	-f for fault tolerant mode
 ETEXI
 
     {
diff --git a/hmp.c b/hmp.c
index 5891507..623a3f0 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1265,10 +1265,11 @@ void hmp_migrate(Monitor *mon, const QDict *qdict)
     int detach = qdict_get_try_bool(qdict, "detach", 0);
     int blk = qdict_get_try_bool(qdict, "blk", 0);
     int inc = qdict_get_try_bool(qdict, "inc", 0);
+    int ft = qdict_get_try_bool(qdict, "fault-tolerant", 0);
     const char *uri = qdict_get_str(qdict, "uri");
     Error *err = NULL;
 
-    qmp_migrate(uri, !!blk, blk, !!inc, inc, false, false, &err);
+    qmp_migrate(uri, !!blk, blk, !!inc, inc, false, false, !!ft, ft, &err);
     if (err) {
         monitor_printf(mon, "migrate: %s\n", error_get_pretty(err));
         error_free(err);
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 140e6b4..fc2b066 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -25,6 +25,7 @@
 
 struct MigrationParams {
     bool blk;
+    bool ft;
     bool shared;
 };
 
diff --git a/migration.c b/migration.c
index 2b1ab20..08dcca0 100644
--- a/migration.c
+++ b/migration.c
@@ -395,7 +395,7 @@ void migrate_del_blocker(Error *reason)
 
 void qmp_migrate(const char *uri, bool has_blk, bool blk,
                  bool has_inc, bool inc, bool has_detach, bool detach,
-                 Error **errp)
+                 bool has_ft, bool ft, Error **errp)
 {
     Error *local_err = NULL;
     MigrationState *s = migrate_get_current();
@@ -404,6 +404,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 
     params.blk = has_blk && blk;
     params.shared = has_inc && inc;
+    params.ft = has_ft && ft;
 
     if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
diff --git a/qapi-schema.json b/qapi-schema.json
index 60f3fd1..49dd5ff 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2594,12 +2594,16 @@
 # @detach: this argument exists only for compatibility reasons and
 #          is ignored by QEMU
 #
+# @fault-tolerant: #optional true to enable fault tolerant
+#                  (since 1.7)
+#
 # Returns: nothing on success
 #
 # Since: 0.14.0
 ##
 { 'command': 'migrate',
-  'data': {'uri': 'str', '*blk': 'bool', '*inc': 'bool', '*detach': 'bool' } }
+  'data': {'uri': 'str', '*blk': 'bool', '*inc': 'bool', '*detach': 'bool',
+           '*fault-tolerant': 'bool' } }
 
 # @xen-save-devices-state:
 #
diff --git a/qmp-commands.hx b/qmp-commands.hx
index fba15cd..ff13baf 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -611,7 +611,7 @@ EQMP
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
+        .args_type  = "detach:-d,blk:-b,inc:-i,fault-tolerant:-f,uri:s",
         .mhandler.cmd_new = qmp_marshal_input_migrate,
     },
 
@@ -625,6 +625,7 @@ Arguments:
 
 - "blk": block migration, full disk copy (json-bool, optional)
 - "inc": incremental disk copy (json-bool, optional)
+- "fault-tolerant": fault tolerant (json-bool, optional)
 - "uri": Destination URI (json-string)
 
 Example:
-- 
1.8.0.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v3 3/4] Curling: the sender
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 1/4] Curling: add doc Jules Wang
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 2/4] Curling: cmdline interface Jules Wang
@ 2013-10-15  7:26 ` Jules Wang
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 4/4] Curling: the receiver Jules Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Jules Wang @ 2013-10-15  7:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Jules Wang, owasserm, quintela

By leveraging live migration feature, the sender simply starts a
new migration when the previous migration is completed.

We need to handle the variables related to live migration very
carefully. So the new migration does not restart from the very
begin of the migration, instead, it continues the previous
migration.

Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
 arch_init.c             | 25 ++++++++++++++++++++-----
 include/sysemu/sysemu.h |  3 ++-
 migration.c             | 25 +++++++++++++++++++++++--
 savevm.c                | 20 ++++++++++++++++----
 4 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 7545d96..f71dfc4 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -107,6 +107,7 @@ const uint32_t arch_type = QEMU_ARCH;
 static bool mig_throttle_on;
 static int dirty_rate_high_cnt;
 static void check_guest_throttling(void);
+static MigrationParams ram_mig_params;
 
 /***********************************************************/
 /* ram save/restore */
@@ -595,6 +596,11 @@ static void ram_migration_cancel(void *opaque)
     migration_end();
 }
 
+static void ram_set_params(const MigrationParams *params, void *opaque)
+{
+    ram_mig_params.ft = params->ft;
+}
+
 static void reset_ram_globals(void)
 {
     last_seen_block = NULL;
@@ -610,10 +616,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMBlock *block;
     int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
+    bool create = false;
 
-    migration_bitmap = bitmap_new(ram_pages);
-    bitmap_set(migration_bitmap, 0, ram_pages);
-    migration_dirty_pages = ram_pages;
+    if (!ram_mig_params.ft || !migration_bitmap)  {
+        migration_bitmap = bitmap_new(ram_pages);
+        bitmap_set(migration_bitmap, 0, ram_pages);
+        migration_dirty_pages = ram_pages;
+        create = true;
+    }
     mig_throttle_on = false;
     dirty_rate_high_cnt = 0;
 
@@ -633,7 +643,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     qemu_mutex_lock_iothread();
     qemu_mutex_lock_ramlist();
     bytes_transferred = 0;
-    reset_ram_globals();
+    if (!ram_mig_params.ft || create) {
+        reset_ram_globals();
+    }
 
     memory_global_dirty_log_start();
     migration_bitmap_sync();
@@ -748,7 +760,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     }
 
     ram_control_after_iterate(f, RAM_CONTROL_FINISH);
-    migration_end();
+    if (!ram_mig_params.ft) {
+        migration_end();
+    }
 
     qemu_mutex_unlock_ramlist();
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
@@ -975,6 +989,7 @@ SaveVMHandlers savevm_ram_handlers = {
     .save_live_pending = ram_save_pending,
     .load_state = ram_load,
     .cancel = ram_migration_cancel,
+    .set_params = ram_set_params,
 };
 
 struct soundhw {
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index cd5791e..31d5e3f 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -82,7 +82,8 @@ bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_state_begin(QEMUFile *f,
                              const MigrationParams *params);
 int qemu_savevm_state_iterate(QEMUFile *f);
-void qemu_savevm_state_complete(QEMUFile *f);
+void qemu_savevm_state_complete(QEMUFile *f,
+                                const MigrationParams *params);
 void qemu_savevm_state_cancel(void);
 uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
 int qemu_loadvm_state(QEMUFile *f);
diff --git a/migration.c b/migration.c
index 08dcca0..28acd05 100644
--- a/migration.c
+++ b/migration.c
@@ -553,6 +553,7 @@ static void *migration_thread(void *opaque)
     int64_t max_size = 0;
     int64_t start_time = initial_time;
     bool old_vm_running = false;
+    int  time_window = 100;
 
     DPRINTF("beginning savevm\n");
     qemu_savevm_state_begin(s->file, &s->params);
@@ -564,6 +565,8 @@ static void *migration_thread(void *opaque)
 
     while (s->state == MIG_STATE_ACTIVE) {
         int64_t current_time;
+        int64_t time_spent;
+        int64_t migration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
         uint64_t pending_size;
 
         if (!qemu_file_rate_limit(s->file)) {
@@ -585,7 +588,7 @@ static void *migration_thread(void *opaque)
                 ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
                 if (ret >= 0) {
                     qemu_file_set_rate_limit(s->file, INT_MAX);
-                    qemu_savevm_state_complete(s->file);
+                    qemu_savevm_state_complete(s->file, &s->params);
                 }
                 qemu_mutex_unlock_iothread();
 
@@ -594,10 +597,28 @@ static void *migration_thread(void *opaque)
                     break;
                 }
 
-                if (!qemu_file_get_error(s->file)) {
+                if (!qemu_file_get_error(s->file) && !s->params.ft) {
                     migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
                     break;
                 }
+
+                if (s->params.ft) {
+                    if (old_vm_running) {
+                        qemu_mutex_lock_iothread();
+                        vm_start();
+                        qemu_mutex_unlock_iothread();
+
+                        current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+                        time_spent = current_time - migration_start_time;
+                        DPRINTF("this migration lasts for %" PRId64 "ms\n",
+                                time_spent);
+                        if (time_spent < time_window) {
+                            g_usleep((time_window - time_spent)*1000);
+                            initial_time += time_window - time_spent;
+                        }
+                    }
+                    qemu_savevm_state_begin(s->file, &s->params);
+                }
             }
         }
 
diff --git a/savevm.c b/savevm.c
index 2f631d4..e75d5d4 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1822,6 +1822,7 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
 }
 
 #define QEMU_VM_FILE_MAGIC           0x5145564d
+#define QEMU_VM_FILE_MAGIC_FT        0x51454654
 #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
 #define QEMU_VM_FILE_VERSION         0x00000003
 
@@ -1831,6 +1832,7 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
 #define QEMU_VM_SECTION_END          0x03
 #define QEMU_VM_SECTION_FULL         0x04
 #define QEMU_VM_SUBSECTION           0x05
+#define QEMU_VM_EOF_MAGIC            0xFEEDCAFE
 
 bool qemu_savevm_state_blocked(Error **errp)
 {
@@ -1858,7 +1860,12 @@ void qemu_savevm_state_begin(QEMUFile *f,
         se->ops->set_params(params, se->opaque);
     }
     
-    qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+    if (params->ft) {
+        qemu_put_be32(f, QEMU_VM_FILE_MAGIC_FT);
+    } else {
+        qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+    }
+
     qemu_put_be32(f, QEMU_VM_FILE_VERSION);
 
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
@@ -1937,7 +1944,8 @@ int qemu_savevm_state_iterate(QEMUFile *f)
     return ret;
 }
 
-void qemu_savevm_state_complete(QEMUFile *f)
+void qemu_savevm_state_complete(QEMUFile *f,
+                                const MigrationParams *params)
 {
     SaveStateEntry *se;
     int ret;
@@ -1990,6 +1998,9 @@ void qemu_savevm_state_complete(QEMUFile *f)
     }
 
     qemu_put_byte(f, QEMU_VM_EOF);
+    if (params->ft) {
+        qemu_put_be32(f, QEMU_VM_EOF_MAGIC);
+    }
     qemu_fflush(f);
 }
 
@@ -2028,7 +2039,8 @@ static int qemu_savevm_state(QEMUFile *f)
     int ret;
     MigrationParams params = {
         .blk = 0,
-        .shared = 0
+        .shared = 0,
+        .ft = 0
     };
 
     if (qemu_savevm_state_blocked(NULL)) {
@@ -2047,7 +2059,7 @@ static int qemu_savevm_state(QEMUFile *f)
 
     ret = qemu_file_get_error(f);
     if (ret == 0) {
-        qemu_savevm_state_complete(f);
+        qemu_savevm_state_complete(f, &params);
         ret = qemu_file_get_error(f);
     }
     if (ret != 0) {
-- 
1.8.0.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH v3 4/4] Curling: the receiver
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
                   ` (2 preceding siblings ...)
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 3/4] Curling: the sender Jules Wang
@ 2013-10-15  7:26 ` Jules Wang
  2013-10-17 11:50 ` [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Stefan Hajnoczi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Jules Wang @ 2013-10-15  7:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Jules Wang, owasserm, quintela

The receiver does migration loop until the migration connection is
lost. Then, it is started as a backup.

The receiver does not load vm state once the migration begins.
Instead, it perfetches one whole migration data into a buffer,
then loads vm state from that buffer afterwards.

Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
 include/migration/qemu-file.h |   1 +
 include/sysemu/sysemu.h       |   2 +
 migration.c                   |  22 ++++--
 savevm.c                      | 158 ++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 173 insertions(+), 10 deletions(-)

diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 0f757fb..f01ff10 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -92,6 +92,7 @@ typedef struct QEMUFileOps {
     QEMURamHookFunc *after_ram_iterate;
     QEMURamHookFunc *hook_ram_load;
     QEMURamSaveFunc *save_page;
+    QEMUFileGetBufferFunc *get_prefetch_buffer;
 } QEMUFileOps;
 
 QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 31d5e3f..e94193c 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -87,6 +87,8 @@ void qemu_savevm_state_complete(QEMUFile *f,
 void qemu_savevm_state_cancel(void);
 uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
 int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state_ft(QEMUFile *f);
+bool is_ft_migration(QEMUFile *f);
 
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
diff --git a/migration.c b/migration.c
index 28acd05..e0734a7 100644
--- a/migration.c
+++ b/migration.c
@@ -19,6 +19,7 @@
 #include "monitor/monitor.h"
 #include "migration/qemu-file.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/cpus.h"
 #include "block/block.h"
 #include "qemu/sockets.h"
 #include "migration/block.h"
@@ -101,13 +102,24 @@ static void process_incoming_migration_co(void *opaque)
 {
     QEMUFile *f = opaque;
     int ret;
+    int count = 0;
 
-    ret = qemu_loadvm_state(f);
-    qemu_fclose(f);
-    if (ret < 0) {
-        fprintf(stderr, "load of migration failed\n");
-        exit(EXIT_FAILURE);
+    if (is_ft_migration(f)) {
+        while (qemu_loadvm_state_ft(f) >= 0) {
+            count++;
+            DPRINTF("incoming count %d\r", count);
+        }
+        qemu_fclose(f);
+        DPRINTF("ft connection lost, launching self..\n");
+    } else {
+        ret = qemu_loadvm_state(f);
+        qemu_fclose(f);
+        if (ret < 0) {
+            fprintf(stderr, "load of migration failed\n");
+            exit(EXIT_FAILURE);
+        }
     }
+    cpu_synchronize_all_post_init();
     qemu_announce_self();
     DPRINTF("successfully loaded vm state\n");
 
diff --git a/savevm.c b/savevm.c
index e75d5d4..611fda2 100644
--- a/savevm.c
+++ b/savevm.c
@@ -52,6 +52,8 @@
 #define ARP_PTYPE_IP 0x0800
 #define ARP_OP_REQUEST_REV 0x3
 
+#define PREFETCH_BUFFER_SIZE 0x010000
+
 static int announce_self_create(uint8_t *buf,
 				uint8_t *mac_addr)
 {
@@ -135,6 +137,10 @@ struct QEMUFile {
     unsigned int iovcnt;
 
     int last_error;
+
+    uint8_t *prefetch_buf;
+    uint64_t prefetch_buf_index;
+    uint64_t prefetch_buf_size;
 };
 
 typedef struct QEMUFileStdio
@@ -193,6 +199,25 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
     return len;
 }
 
+static int socket_get_prefetch_buffer(void *opaque, uint8_t *buf,
+                                      int64_t pos, int size)
+{
+    QEMUFile *f = opaque;
+
+    if (f->prefetch_buf_size - pos <= 0) {
+        return 0;
+    }
+
+    if (f->prefetch_buf_size - pos < size) {
+        size = f->prefetch_buf_size - pos;
+    }
+
+    memcpy(buf, f->prefetch_buf + pos, size);
+
+    return size;
+}
+
+
 static int socket_close(void *opaque)
 {
     QEMUFileSocket *s = opaque;
@@ -440,6 +465,7 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
 static const QEMUFileOps socket_read_ops = {
     .get_fd =     socket_get_fd,
     .get_buffer = socket_get_buffer,
+    .get_prefetch_buffer = socket_get_prefetch_buffer,
     .close =      socket_close
 };
 
@@ -746,6 +772,8 @@ int qemu_fclose(QEMUFile *f)
     if (f->last_error) {
         ret = f->last_error;
     }
+
+    g_free(f->prefetch_buf);
     g_free(f);
     return ret;
 }
@@ -829,6 +857,14 @@ void qemu_put_byte(QEMUFile *f, int v)
 
 static void qemu_file_skip(QEMUFile *f, int size)
 {
+    if (f->prefetch_buf_index + size <= f->prefetch_buf_size) {
+        f->prefetch_buf_index += size;
+        return;
+    } else {
+        size -= f->prefetch_buf_size - f->prefetch_buf_index;
+        f->prefetch_buf_index = f->prefetch_buf_size;
+    }
+
     if (f->buf_index + size <= f->buf_size) {
         f->buf_index += size;
     }
@@ -838,6 +874,23 @@ static int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
 {
     int pending;
     int index;
+    int done;
+
+    if (f->ops->get_prefetch_buffer) {
+        if (f->prefetch_buf_index + offset < f->prefetch_buf_size) {
+            done = f->ops->get_prefetch_buffer(f,
+                                               buf,
+                                               f->prefetch_buf_index + offset,
+                                               size);
+            if (done == size) {
+                return size;
+            }
+            size -= done;
+            buf  += done;
+        } else {
+            offset -= f->prefetch_buf_size - f->prefetch_buf_index;
+        }
+    }
 
     assert(!qemu_file_is_writable(f));
 
@@ -882,7 +935,15 @@ int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size)
 
 static int qemu_peek_byte(QEMUFile *f, int offset)
 {
-    int index = f->buf_index + offset;
+    int index;
+
+    if (f->prefetch_buf_index + offset < f->prefetch_buf_size) {
+        return f->prefetch_buf[f->prefetch_buf_index + offset];
+    } else {
+        offset -= f->prefetch_buf_size - f->prefetch_buf_index;
+    }
+
+    index = f->buf_index + offset;
 
     assert(!qemu_file_is_writable(f));
 
@@ -896,6 +957,16 @@ static int qemu_peek_byte(QEMUFile *f, int offset)
     return f->buf[index];
 }
 
+static unsigned int qemu_peek_be32(QEMUFile *f, int offset)
+{
+    unsigned int v;
+    v = qemu_peek_byte(f, offset) << 24;
+    v |= qemu_peek_byte(f, offset + 1) << 16;
+    v |= qemu_peek_byte(f, offset + 2) << 8;
+    v |= qemu_peek_byte(f, offset + 3);
+    return v;
+}
+
 int qemu_get_byte(QEMUFile *f)
 {
     int result;
@@ -983,7 +1054,6 @@ uint64_t qemu_get_be64(QEMUFile *f)
     return v;
 }
 
-
 /* timer */
 
 void timer_put(QEMUFile *f, QEMUTimer *ts)
@@ -2200,6 +2270,11 @@ static void vmstate_subsection_save(QEMUFile *f, const VMStateDescription *vmsd,
     }
 }
 
+bool is_ft_migration(QEMUFile *f)
+{
+    return (qemu_peek_be32(f, 0) == QEMU_VM_FILE_MAGIC_FT);
+}
+
 typedef struct LoadStateEntry {
     QLIST_ENTRY(LoadStateEntry) entry;
     SaveStateEntry *se;
@@ -2221,8 +2296,9 @@ int qemu_loadvm_state(QEMUFile *f)
     }
 
     v = qemu_get_be32(f);
-    if (v != QEMU_VM_FILE_MAGIC)
+    if (v != QEMU_VM_FILE_MAGIC && v != QEMU_VM_FILE_MAGIC_FT) {
         return -EINVAL;
+    }
 
     v = qemu_get_be32(f);
     if (v == QEMU_VM_FILE_VERSION_COMPAT) {
@@ -2309,8 +2385,6 @@ int qemu_loadvm_state(QEMUFile *f)
         }
     }
 
-    cpu_synchronize_all_post_init();
-
     ret = 0;
 
 out:
@@ -2326,6 +2400,79 @@ out:
     return ret;
 }
 
+int qemu_loadvm_state_ft(QEMUFile *f)
+{
+    int ret = 0;
+    int i   = 0;
+    int done = 0;
+    uint64_t size = 0;
+    uint64_t offset = 0;
+    uint8_t *prefetch_buf = NULL;
+    uint8_t *buf = NULL;
+
+    uint64_t max_mem = last_ram_offset() * 1.5;
+    uint64_t eof = htobe64((uint64_t)QEMU_VM_EOF_MAGIC << 32 |
+                                  QEMU_VM_FILE_MAGIC_FT);
+
+    if (!f->ops->get_prefetch_buffer) {
+        fprintf(stderr, "Fault tolerant is not supported by this protocol.\n");
+        return -EINVAL;
+    }
+
+    size = PREFETCH_BUFFER_SIZE;
+    prefetch_buf = g_malloc(size);
+
+    while (true) {
+        if (offset + TARGET_PAGE_SIZE >= size) {
+            if (size*2 > max_mem) {
+                fprintf(stderr, "qemu_loadvm_state_ft: warning:" \
+                       "Prefetch buffer becomes too large.\n" \
+                       "Fault tolerant is unstable when you see this,\n" \
+                       "please increase the bandwidth or increase " \
+                       "the max down time.\n");
+                break;
+            }
+            size = size * 2;
+            buf = g_try_realloc(prefetch_buf, size);
+            if (!buf) {
+                error_report("qemu_loadvm_state_ft: out of memory.\n");
+                g_free(prefetch_buf);
+                return -ENOMEM;
+            }
+
+            prefetch_buf = buf;
+        }
+
+        done = qemu_get_buffer(f, prefetch_buf + offset, TARGET_PAGE_SIZE);
+
+        ret = qemu_file_get_error(f);
+        if (ret != 0) {
+            g_free(prefetch_buf);
+            return ret;
+        }
+
+        buf = prefetch_buf + offset;
+        offset += done;
+        for (i = -7; i < done; i++) {
+            if (memcmp(buf + i, &eof, 8) == 0) {
+                goto out;
+            }
+        }
+    }
+ out:
+    g_free(f->prefetch_buf);
+    f->prefetch_buf_size = offset;
+    f->prefetch_buf_index = 0;
+    f->prefetch_buf = prefetch_buf;
+
+    ret = qemu_loadvm_state(f);
+
+    /* Skip magic number */
+    qemu_get_be32(f);
+
+    return ret;
+}
+
 static BlockDriverState *find_vmstate_bs(void)
 {
     BlockDriverState *bs = NULL;
@@ -2437,6 +2584,7 @@ void do_savevm(Monitor *mon, const QDict *qdict)
         goto the_end;
     }
     ret = qemu_savevm_state(f);
+    cpu_synchronize_all_post_init();
     vm_state_size = qemu_ftell(f);
     qemu_fclose(f);
     if (ret < 0) {
-- 
1.8.0.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/4] Curling: add doc
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 1/4] Curling: add doc Jules Wang
@ 2013-10-17 11:25   ` Stefan Hajnoczi
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2013-10-17 11:25 UTC (permalink / raw)
  To: Jules Wang; +Cc: pbonzini, quintela, qemu-devel, owasserm

On Tue, Oct 15, 2013 at 03:26:20PM +0800, Jules Wang wrote:
> +Usage
> +=====
> +The steps of curling are the same as the steps of live migration except the
> +following:
> +1. Start ft in the qemu monitor of sender vm by following cmdline:
> +   > migrate_set_speed <full bandwidth>
> +   > migrate -f tcp:<address>:<port>
> +2. Connect to the receiver vm by vnc or spice. The screen of the vm is displayed
> +when ft is ready.

Management tools (like libvirt) need a QMP event that reports when FT is
active.  This allows users to check the FT status of a guest and
understand when the guest is protected.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
                   ` (3 preceding siblings ...)
  2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 4/4] Curling: the receiver Jules Wang
@ 2013-10-17 11:50 ` Stefan Hajnoczi
  2013-10-23  0:08   ` Jules
  2013-10-22 21:00 ` Michael R. Hines
  2013-10-22 21:08 ` Michael R. Hines
  6 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2013-10-17 11:50 UTC (permalink / raw)
  To: Jules Wang; +Cc: pbonzini, quintela, qemu-devel, owasserm

On Tue, Oct 15, 2013 at 03:26:19PM +0800, Jules Wang wrote:
> v2 -> v3:
> * add documentation of new option in qapi-schema.
> 
> * long option name: ft -> fault-tolerant
> 
> v1 -> v2:
> * cmdline: migrate curling:tcp:<address>:<port> 
>        ->  migrate -f tcp:<address>:<port>
> 
> * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
>           to indicate this is a ft migration.
> 
> * receiver: look for the signature: 
>             QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
>             which indicates the end of one migration.
> --
> Jules Wang (4):
>   Curling: add doc
>   Curling: cmdline interface.
>   Curling: the sender
>   Curling: the receiver

It would be helpful to clarify the status of Curling in the cover letter
email so reviewers know what to expect.

This series does not address I/O or failover.  I guess you are aware of
the missing topics that I mentioned, here are my thoughts on them:

I/O needs to be held back until the destination host has acknowledged
receiving the last full migration state.  The outside world cannot
witness state changes in the guest until the migration state has been
successfully transferred to the destination host.  Otherwise the guest
may appear to act incorrectly when resuming execution from the last
snapshot.

The time period used by the FT sender thread determines how much latency
is added to I/O requests.

Failover functionality is missing from these patches.  We cannot simply
start executing on the destination host when the migration connection
ends.  If the guest disk image is located on shared storage then
split-brain occurs when a network error terminates the migration
connection - will both hosts begin accessing the shared disk?

What is your plan to address these issues?

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
                   ` (4 preceding siblings ...)
  2013-10-17 11:50 ` [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Stefan Hajnoczi
@ 2013-10-22 21:00 ` Michael R. Hines
  2013-10-23  5:23   ` Jules
  2013-10-22 21:08 ` Michael R. Hines
  6 siblings, 1 reply; 13+ messages in thread
From: Michael R. Hines @ 2013-10-22 21:00 UTC (permalink / raw)
  To: Jules Wang
  Cc: pbonzini, Juan Jose Quintela Carreira, Michael R. Hines,
	qemu-devel, owasserm


On 10/15/2013 03:26 AM, Jules Wang wrote:
> v2 -> v3:
> * add documentation of new option in qapi-schema.
>
> * long option name: ft -> fault-tolerant
>
> v1 -> v2:
> * cmdline: migrate curling:tcp:<address>:<port>
>         ->  migrate -f tcp:<address>:<port>
>
> * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
>            to indicate this is a ft migration.
>
> * receiver: look for the signature:
>              QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
>              which indicates the end of one migration.
> --
> Jules Wang (4):
>    Curling: add doc
>    Curling: cmdline interface.
>    Curling: the sender
>    Curling: the receiver
>
>   arch_init.c                   |  25 ++++--
>   docs/curling.txt              |  51 ++++++++++++
>   hmp-commands.hx               |  10 ++-
>   hmp.c                         |   3 +-
>   include/migration/migration.h |   1 +
>   include/migration/qemu-file.h |   1 +
>   include/sysemu/sysemu.h       |   5 +-
>   migration.c                   |  50 ++++++++++--
>   qapi-schema.json              |   6 +-
>   qmp-commands.hx               |   3 +-
>   savevm.c                      | 178 +++++++++++++++++++++++++++++++++++++++---
>   11 files changed, 303 insertions(+), 30 deletions(-)
>   create mode 100644 docs/curling.txt
>

Jules, I think we should work together. The patches I sent this week
solve all of the problems (and more) of Kemari and have been in
testing for over 1 year.

1. I/O buffering is already working
2. Checkpoint parallelism is already working
3. Staging of the checkpoint memory is already working
     on both the sender side and receiver side.
3. Checkpoint chunking is already working (this means that checkpoints
     can be very large and must be split up like slab caches,
     which can dynamically grow and shrink as the amount of
     diryt memory in the virtual machine fluctuates.
4. RDMA checkpointing is already working
5. TCP checkpointing is already working
6. There does not need to be a custom migration URI
      - this is easily implemented through a capability.
7. Libvirt support is already available on github.
8 There is no need to modify the QEMU migration metadata state information.

All of these features take advantage of the recent advances
in QEMU in migration performance improvements over the last
few years.

Would you be interested in "joining forces"? You even picked
a cool name (I didn't even choose a name)..... =)

Also: I will soon be working in IBM China Beijing,  for 3 years - starting
next month - perhaps we could talk on the phone (or meet in person)?

- Michael

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
                   ` (5 preceding siblings ...)
  2013-10-22 21:00 ` Michael R. Hines
@ 2013-10-22 21:08 ` Michael R. Hines
  6 siblings, 0 replies; 13+ messages in thread
From: Michael R. Hines @ 2013-10-22 21:08 UTC (permalink / raw)
  To: Jules Wang; +Cc: pbonzini, quintela, qemu-devel, owasserm

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On 10/15/2013 03:26 AM, Jules Wang wrote:
> v2 -> v3:
> * add documentation of new option in qapi-schema.
>
> * long option name: ft -> fault-tolerant
>
> v1 -> v2:
> * cmdline: migrate curling:tcp:<address>:<port>
>         ->  migrate -f tcp:<address>:<port>
>
> * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
>            to indicate this is a ft migration.
>
> * receiver: look for the signature:
>              QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
>              which indicates the end of one migration.
> --
> Jules Wang (4):
>    Curling: add doc
>    Curling: cmdline interface.
>    Curling: the sender
>    Curling: the receiver
>
>   arch_init.c                   |  25 ++++--
>   docs/curling.txt              |  51 ++++++++++++
>   hmp-commands.hx               |  10 ++-
>   hmp.c                         |   3 +-
>   include/migration/migration.h |   1 +
>   include/migration/qemu-file.h |   1 +
>   include/sysemu/sysemu.h       |   5 +-
>   migration.c                   |  50 ++++++++++--
>   qapi-schema.json              |   6 +-
>   qmp-commands.hx               |   3 +-
>   savevm.c                      | 178 +++++++++++++++++++++++++++++++++++++++---
>   11 files changed, 303 insertions(+), 30 deletions(-)
>   create mode 100644 docs/curling.txt
>

Ooops, forgot to send you the wiki link:

http://wiki.qemu.org/Features/MicroCheckpointing

[-- Attachment #2: Type: text/html, Size: 1914 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-17 11:50 ` [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Stefan Hajnoczi
@ 2013-10-23  0:08   ` Jules
  2013-10-24 12:10     ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Jules @ 2013-10-23  0:08 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: pbonzini, quintela, qemu-devel, owasserm


> On Tue, Oct 15, 2013 at 03:26:19PM +0800, Jules Wang wrote:
> > v2 -> v3:
> > * add documentation of new option in qapi-schema.
> > 
> > * long option name: ft -> fault-tolerant
> > 
> > v1 -> v2:
> > * cmdline: migrate curling:tcp:<address>:<port> 
> >        ->  migrate -f tcp:<address>:<port>
> > 
> > * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
> >           to indicate this is a ft migration.
> > 
> > * receiver: look for the signature: 
> >             QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
> >             which indicates the end of one migration.
> > --
> > Jules Wang (4):
> >   Curling: add doc
> >   Curling: cmdline interface.
> >   Curling: the sender
> >   Curling: the receiver
> 

First of all, thanks for your superb and spot-on comments.

> It would be helpful to clarify the status of Curling in the cover letter
> email so reviewers know what to expect.

OK, but I'm not quite clear about how to clarify the status, would you
pls give me an example? 
> 
> This series does not address I/O or failover.  I guess you are aware of
> the missing topics that I mentioned, here are my thoughts on them:
> 
> I/O needs to be held back until the destination host has acknowledged
> receiving the last full migration state.  The outside world cannot
> witness state changes in the guest until the migration state has been
> successfully transferred to the destination host.  Otherwise the guest
> may appear to act incorrectly when resuming execution from the last
> snapshot.
> 
> The time period used by the FT sender thread determines how much latency
> is added to I/O requests.

Yes, there is the latency. That is inevitable.

I guess you mean the following situation:
If a msg 'hello' is sent to the chat room server just a few seconds
before the failover happens, there is a possibility that the msg will be
sent to the others twice or be lost.

Am I right?

> 
> Failover functionality is missing from these patches.  We cannot simply
> start executing on the destination host when the migration connection
> ends.  If the guest disk image is located on shared storage then
> split-brain occurs when a network error terminates the migration
> connection - 

> will both hosts begin accessing the shared disk? 
YES
> 

I have a simple way to handle that. In one word, the third point
--gateway.

Both the sender and the receiver check the connectivity to the gateway
every X seconds. Let's use A and B stand for whether the sender and the
receiver are connected to the gateway respectively.

When the connection between the sender and the receiver is down.
A && B is false.

If A is false, the vm instance at the sender will be stopped.
If B is false, the vm instance at the receiver will not be started.

a.A false  B false: 0 vm run
b.A false  B true: 1 vm run 
c.A true   B false: 1 vm run
d.A true   B true : 1 vm run (normal case)

It becomes complicated when we consider the state transitions in
these four states.
  
I suggest adding this feature to libvirt instead of qemu.


> What is your plan to address these issues?
> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-22 21:00 ` Michael R. Hines
@ 2013-10-23  5:23   ` Jules
  2013-11-06 18:38     ` Michael R. Hines
  0 siblings, 1 reply; 13+ messages in thread
From: Jules @ 2013-10-23  5:23 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: pbonzini, Juan Jose Quintela Carreira, Michael R. Hines,
	qemu-devel, owasserm



On 2013-10-22 17:00 -0400，Michael R. Hines wrote：
> On 10/15/2013 03:26 AM, Jules Wang wrote:
> > v2 -> v3:
> > * add documentation of new option in qapi-schema.
> >
> > * long option name: ft -> fault-tolerant
> >
> > v1 -> v2:
> > * cmdline: migrate curling:tcp:<address>:<port>
> >         ->  migrate -f tcp:<address>:<port>
> >
> > * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
> >            to indicate this is a ft migration.
> >
> > * receiver: look for the signature:
> >              QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
> >              which indicates the end of one migration.
> > --
> > Jules Wang (4):
> >    Curling: add doc
> >    Curling: cmdline interface.
> >    Curling: the sender
> >    Curling: the receiver
> >
> >   arch_init.c                   |  25 ++++--
> >   docs/curling.txt              |  51 ++++++++++++
> >   hmp-commands.hx               |  10 ++-
> >   hmp.c                         |   3 +-
> >   include/migration/migration.h |   1 +
> >   include/migration/qemu-file.h |   1 +
> >   include/sysemu/sysemu.h       |   5 +-
> >   migration.c                   |  50 ++++++++++--
> >   qapi-schema.json              |   6 +-
> >   qmp-commands.hx               |   3 +-
> >   savevm.c                      | 178 +++++++++++++++++++++++++++++++++++++++---
> >   11 files changed, 303 insertions(+), 30 deletions(-)
> >   create mode 100644 docs/curling.txt
> >
> 
> Jules, I think we should work together. The patches I sent this week
> solve all of the problems (and more) of Kemari and have been in
> testing for over 1 year.
> 
> 1. I/O buffering is already working
> 2. Checkpoint parallelism is already working
> 3. Staging of the checkpoint memory is already working
>      on both the sender side and receiver side.
> 3. Checkpoint chunking is already working (this means that checkpoints
>      can be very large and must be split up like slab caches,
>      which can dynamically grow and shrink as the amount of
>      diryt memory in the virtual machine fluctuates.
> 4. RDMA checkpointing is already working
> 5. TCP checkpointing is already working
> 6. There does not need to be a custom migration URI
>       - this is easily implemented through a capability.
> 7. Libvirt support is already available on github.
> 8 There is no need to modify the QEMU migration metadata state information.
> 
> All of these features take advantage of the recent advances
> in QEMU in migration performance improvements over the last
> few years.

I will read your patches carefully as a good learning material.
> 
> Would you be interested in "joining forces"? You even picked
> a cool name (I didn't even choose a name)..... =)
Yes, your solution is better than mine obviously, and we could work
together to improve your patches. 
> 
> Also: I will soon be working in IBM China Beijing,  for 3 years - starting
> next month - perhaps we could talk on the phone (or meet in person)?
Welcome to Beijing and take some dust masks with you, you will need
them. :)
I prefer email or meet in person if necessary. I will read and try your
patches first.

> - Michael
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-23  0:08   ` Jules
@ 2013-10-24 12:10     ` Stefan Hajnoczi
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2013-10-24 12:10 UTC (permalink / raw)
  To: Jules; +Cc: Paolo Bonzini, Juan Quintela, qemu-devel, Orit Wasserman

On Wed, Oct 23, 2013 at 1:08 AM, Jules <junqing.wang@cs2c.com.cn> wrote:
>
>> On Tue, Oct 15, 2013 at 03:26:19PM +0800, Jules Wang wrote:
>> > v2 -> v3:
>> > * add documentation of new option in qapi-schema.
>> >
>> > * long option name: ft -> fault-tolerant
>> >
>> > v1 -> v2:
>> > * cmdline: migrate curling:tcp:<address>:<port>
>> >        ->  migrate -f tcp:<address>:<port>
>> >
>> > * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
>> >           to indicate this is a ft migration.
>> >
>> > * receiver: look for the signature:
>> >             QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
>> >             which indicates the end of one migration.
>> > --
>> > Jules Wang (4):
>> >   Curling: add doc
>> >   Curling: cmdline interface.
>> >   Curling: the sender
>> >   Curling: the receiver
>>
>
> First of all, thanks for your superb and spot-on comments.
>
>> It would be helpful to clarify the status of Curling in the cover letter
>> email so reviewers know what to expect.
>
> OK, but I'm not quite clear about how to clarify the status, would you
> pls give me an example?

That status would be an explanation of what is current included in the
patch, which functionality already works, and what you still plan to
implement before the series can be merged.

>> This series does not address I/O or failover.  I guess you are aware of
>> the missing topics that I mentioned, here are my thoughts on them:
>>
>> I/O needs to be held back until the destination host has acknowledged
>> receiving the last full migration state.  The outside world cannot
>> witness state changes in the guest until the migration state has been
>> successfully transferred to the destination host.  Otherwise the guest
>> may appear to act incorrectly when resuming execution from the last
>> snapshot.
>>
>> The time period used by the FT sender thread determines how much latency
>> is added to I/O requests.
>
> Yes, there is the latency. That is inevitable.
>
> I guess you mean the following situation:
> If a msg 'hello' is sent to the chat room server just a few seconds
> before the failover happens, there is a possibility that the msg will be
> sent to the others twice or be lost.
>
> Am I right?

Yes, and this is a fundamental requirement for FT.

I/O is not idempotent.  This means it is not possible to repeat the
same operation twice and get the same result.

Other fault tolerance solutions include a mechanism to hold back I/O
until the checkpoint has been committed by the other host.  This way
no I/O is repeated and applications will not break during failover.

For example, imagine a "compare and swap" operation.  If the VM sends
out a "compare and swap" command to a remote server and fails, then
your current patches may send the command again on the other host.
The problem is that the command will not succeed the second time and
therefore the application fails with an error.

>>
>> Failover functionality is missing from these patches.  We cannot simply
>> start executing on the destination host when the migration connection
>> ends.  If the guest disk image is located on shared storage then
>> split-brain occurs when a network error terminates the migration
>> connection -
>
>> will both hosts begin accessing the shared disk?
> YES
>>
>
> I have a simple way to handle that. In one word, the third point
> --gateway.
>
> Both the sender and the receiver check the connectivity to the gateway
> every X seconds. Let's use A and B stand for whether the sender and the
> receiver are connected to the gateway respectively.
>
> When the connection between the sender and the receiver is down.
> A && B is false.
>
> If A is false, the vm instance at the sender will be stopped.
> If B is false, the vm instance at the receiver will not be started.
>
> a.A false  B false: 0 vm run
> b.A false  B true: 1 vm run
> c.A true   B false: 1 vm run
> d.A true   B true : 1 vm run (normal case)
>
> It becomes complicated when we consider the state transitions in
> these four states.
>
> I suggest adding this feature to libvirt instead of qemu.

I agree that the details of the failover (aka quorum and fencing)
should be implemented as policies outside QEMU, if possible.

Also, there were two presentations about fault tolerance at KVM Forum
2013 a few days ago:
https://docs.google.com/file/d/0BzyAwvVlQckebVBrNXdlaTdWVUk/edit
https://docs.google.com/file/d/0BzyAwvVlQckeczNUZHRod28yVXc/edit

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance
  2013-10-23  5:23   ` Jules
@ 2013-11-06 18:38     ` Michael R. Hines
  0 siblings, 0 replies; 13+ messages in thread
From: Michael R. Hines @ 2013-11-06 18:38 UTC (permalink / raw)
  To: Jules
  Cc: qemu-devel, pbonzini, owasserm, Michael R. Hines,
	Juan Jose Quintela Carreira

On 10/23/2013 01:23 AM, Jules wrote:
>
> On 2013-10-22 17:00 -0400，Michael R. Hines wrote：
>> On 10/15/2013 03:26 AM, Jules Wang wrote:
>>> v2 -> v3:
>>> * add documentation of new option in qapi-schema.
>>>
>>> * long option name: ft -> fault-tolerant
>>>
>>> v1 -> v2:
>>> * cmdline: migrate curling:tcp:<address>:<port>
>>>          ->  migrate -f tcp:<address>:<port>
>>>
>>> * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration
>>>             to indicate this is a ft migration.
>>>
>>> * receiver: look for the signature:
>>>               QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total)
>>>               which indicates the end of one migration.
>>> --
>>> Jules Wang (4):
>>>     Curling: add doc
>>>     Curling: cmdline interface.
>>>     Curling: the sender
>>>     Curling: the receiver
>>>
>>>    arch_init.c                   |  25 ++++--
>>>    docs/curling.txt              |  51 ++++++++++++
>>>    hmp-commands.hx               |  10 ++-
>>>    hmp.c                         |   3 +-
>>>    include/migration/migration.h |   1 +
>>>    include/migration/qemu-file.h |   1 +
>>>    include/sysemu/sysemu.h       |   5 +-
>>>    migration.c                   |  50 ++++++++++--
>>>    qapi-schema.json              |   6 +-
>>>    qmp-commands.hx               |   3 +-
>>>    savevm.c                      | 178 +++++++++++++++++++++++++++++++++++++++---
>>>    11 files changed, 303 insertions(+), 30 deletions(-)
>>>    create mode 100644 docs/curling.txt
>>>
>> Jules, I think we should work together. The patches I sent this week
>> solve all of the problems (and more) of Kemari and have been in
>> testing for over 1 year.
>>
>> 1. I/O buffering is already working
>> 2. Checkpoint parallelism is already working
>> 3. Staging of the checkpoint memory is already working
>>       on both the sender side and receiver side.
>> 3. Checkpoint chunking is already working (this means that checkpoints
>>       can be very large and must be split up like slab caches,
>>       which can dynamically grow and shrink as the amount of
>>       diryt memory in the virtual machine fluctuates.
>> 4. RDMA checkpointing is already working
>> 5. TCP checkpointing is already working
>> 6. There does not need to be a custom migration URI
>>        - this is easily implemented through a capability.
>> 7. Libvirt support is already available on github.
>> 8 There is no need to modify the QEMU migration metadata state information.
>>
>> All of these features take advantage of the recent advances
>> in QEMU in migration performance improvements over the last
>> few years.
> I will read your patches carefully as a good learning material.

Cool - I'm back from travelling. Sorry for the delayed response.

I look forward to a review from you of the code - I'm excited to get 
some kind review going.


>> Would you be interested in "joining forces"? You even picked
>> a cool name (I didn't even choose a name)..... =)
> Yes, your solution is better than mine obviously, and we could work
> together to improve your patches.
>> Also: I will soon be working in IBM China Beijing,  for 3 years - starting
>> next month - perhaps we could talk on the phone (or meet in person)?
> Welcome to Beijing and take some dust masks with you, you will need
> them. :)
> I prefer email or meet in person if necessary. I will read and try your
> patches first.

Thank you - I will reach out to you once I've arrived.

- Michael

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-11-06 18:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-15  7:26 [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Jules Wang
2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 1/4] Curling: add doc Jules Wang
2013-10-17 11:25   ` Stefan Hajnoczi
2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 2/4] Curling: cmdline interface Jules Wang
2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 3/4] Curling: the sender Jules Wang
2013-10-15  7:26 ` [Qemu-devel] [PATCH v3 4/4] Curling: the receiver Jules Wang
2013-10-17 11:50 ` [Qemu-devel] [PATCH v3 0/4] Curling: KVM Fault Tolerance Stefan Hajnoczi
2013-10-23  0:08   ` Jules
2013-10-24 12:10     ` Stefan Hajnoczi
2013-10-22 21:00 ` Michael R. Hines
2013-10-23  5:23   ` Jules
2013-11-06 18:38     ` Michael R. Hines
2013-10-22 21:08 ` Michael R. Hines

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).