xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/6] migration/remus: bug fix and cleanup
@ 2016-01-19  7:17 Wen Congyang
  2016-01-19  7:17 ` [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state Wen Congyang
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

Wen Congyang (6):
  remus: don't do failover if we don't have an consistent state
  remus: don't call stream_continue() when doing failover
  remus: resume immediately if libxl__xc_domain_save_done() completes
  tools/libxc: don't send end record if remus fails
  tools/libxc: error handling for the postcopy() callback
  tools/libxl: remove unused function libxl__domain_save_device_model()

 tools/libxc/xc_sr_restore.c      |  6 ++-
 tools/libxc/xc_sr_save.c         |  6 ++-
 tools/libxl/libxl.c              |  4 --
 tools/libxl/libxl.h              |  4 ++
 tools/libxl/libxl_dom.c          | 91 ----------------------------------------
 tools/libxl/libxl_internal.h     |  6 ---
 tools/libxl/libxl_stream_read.c  | 36 +++++++++++++---
 tools/libxl/libxl_stream_write.c | 14 ++++++-
 8 files changed, 56 insertions(+), 111 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
@ 2016-01-19  7:17 ` Wen Congyang
  2016-01-19 10:48   ` Ian Campbell
  2016-01-19  7:17 ` [PATCH v5 2/6] remus: don't call stream_continue() when doing failover Wen Congyang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

We will have an consistent state when a CHECKPOINT_END record
is received. After the first CHECKPOINT_END record is received,
we will buffer all records until the next CHECKPOINT_END record
is received. So if the checkpoint() callback returns XGR_CHECKPOINT_FAILOVER,
we only can do failover if ctx->restore.buffer_all_records is
true.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_sr_restore.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 05159bb..9fe2829 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -493,7 +493,11 @@ static int handle_checkpoint(struct xc_sr_context *ctx)
         break;
 
     case XGR_CHECKPOINT_FAILOVER:
-        rc = BROKEN_CHANNEL;
+        if ( ctx->restore.buffer_all_records )
+            rc = BROKEN_CHANNEL;
+        else
+            /* We don't have an consistent state */
+            rc = -1;
         goto err;
 
     default: /* Other fatal error */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/6] remus: don't call stream_continue() when doing failover
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
  2016-01-19  7:17 ` [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state Wen Congyang
@ 2016-01-19  7:17 ` Wen Congyang
  2016-01-19 10:50   ` Ian Campbell
  2016-01-19  7:17 ` [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes Wen Congyang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

stream_continue() is used for migration to read emulator
xenstore data and emulator context. For remus, if we do
failover, we have read it in the checkpoint cycle, and
we only need to complete the stream.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxl/libxl_stream_read.c | 36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/tools/libxl/libxl_stream_read.c b/tools/libxl/libxl_stream_read.c
index 258dec4..dac134e 100644
--- a/tools/libxl/libxl_stream_read.c
+++ b/tools/libxl/libxl_stream_read.c
@@ -104,6 +104,20 @@
  * Depending on the contents of the stream, there are likely to be several
  * parallel tasks being managed.  check_all_finished() is used to join all
  * tasks in both success and error cases.
+ *
+ * Failover for remus
+ *  - We buffer all records until a CHECKPOINT_END record is received
+ *  - We will consume the buffered records when a CHECKPOINT_END record
+ *    is received
+ *  - If we find some internal error, then rc or retval is not 0 in
+ *    libxl__xc_domain_restore_done(). In this case, we don't resume the
+ *    guest
+ *  - If we need to do failover from primary, then rc and retval are both
+ *    0 in libxl__xc_domain_restore_done(). In this case, the buffered
+ *    state will be dropped, because we haven't received a CHECKPOINT_END
+ *    record, and therefore the buffered state is inconsistent. In
+ *    libxl__xc_domain_restore_done(), we just complete the stream and
+ *    stream->completion_callback() will be called to resume the guest
  */
 
 /* Success/error/cleanup handling. */
@@ -758,6 +772,9 @@ void libxl__xc_domain_restore_done(libxl__egc *egc, void *dcs_void,
     libxl__stream_read_state *stream = &dcs->srs;
     STATE_AO_GC(dcs->ao);
 
+    /* convenience aliases */
+    const int checkpointed_stream = dcs->restore_params.checkpointed_stream;
+
     if (rc)
         goto err;
 
@@ -777,11 +794,20 @@ void libxl__xc_domain_restore_done(libxl__egc *egc, void *dcs_void,
      * If the stream is not still alive, we must not continue any work.
      */
     if (libxl__stream_read_inuse(stream)) {
-        /*
-         * Libxc has indicated that it is done with the stream.  Resume reading
-         * libxl records from it.
-         */
-        stream_continue(egc, stream);
+        if (checkpointed_stream) {
+            /*
+             * Failover from primary. Domain state is currently at a
+             * consistent checkpoint, complete the stream, and call
+             * stream->completion_callback() to resume the guest.
+             */
+            stream_complete(egc, stream, 0);
+        } else {
+            /*
+             * Libxc has indicated that it is done with the stream.
+             * Resume reading libxl records from it.
+             */
+            stream_continue(egc, stream);
+        }
     }
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
  2016-01-19  7:17 ` [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state Wen Congyang
  2016-01-19  7:17 ` [PATCH v5 2/6] remus: don't call stream_continue() when doing failover Wen Congyang
@ 2016-01-19  7:17 ` Wen Congyang
  2016-01-19 10:55   ` Ian Campbell
  2016-01-19  7:17 ` [PATCH v5 4/6] tools/libxc: don't send end record if remus fails Wen Congyang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

For example: if the secondary host is down, and we fail to send the data to
the secondary host. xc_domain_save() returns 0. So in the function
libxl__xc_domain_save_done(), rc is 0 (the helper program exits normally),
and retval is 0 (it is xc_domain_save()'s return value). In such case, we
just need to complete the stream.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 tools/libxl/libxl.h              |  4 ++++
 tools/libxl/libxl_stream_write.c | 14 ++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 7114491..df6c7a3 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1215,6 +1215,10 @@ int libxl_domain_resume(libxl_ctx *ctx, uint32_t domid, int suspend_cancel,
                         const libxl_asyncop_how *ao_how)
                         LIBXL_EXTERNAL_CALLERS_ONLY;
 
+/*
+ * This function doesn't return until something is wrong, and we need to
+ * do failover from secondary.
+ */
 int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
                              uint32_t domid, int send_fd, int recv_fd,
                              const libxl_asyncop_how *ao_how)
diff --git a/tools/libxl/libxl_stream_write.c b/tools/libxl/libxl_stream_write.c
index 80d9208..21b4b51 100644
--- a/tools/libxl/libxl_stream_write.c
+++ b/tools/libxl/libxl_stream_write.c
@@ -354,8 +354,18 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
      * alive, and check_all_finished() may have torn it down around us.
      * If the stream is not still alive, we must not continue any work.
      */
-    if (libxl__stream_write_inuse(stream))
-        write_emulator_xenstore_record(egc, stream);
+    if (libxl__stream_write_inuse(stream)) {
+        if (dss->remus)
+            /*
+             * For remus, if libxl__xc_domain_save_done() completes,
+             * there was an error sending data to the secondary.
+             * Resume the primary ASAP. The caller doesn't care of the
+             * return value (Please refer to libxl__remus_teardown())
+             */
+            stream_complete(egc, stream, 0);
+        else
+            write_emulator_xenstore_record(egc, stream);
+    }
 }
 
 static void write_emulator_xenstore_record(libxl__egc *egc,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 4/6] tools/libxc: don't send end record if remus fails
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
                   ` (2 preceding siblings ...)
  2016-01-19  7:17 ` [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes Wen Congyang
@ 2016-01-19  7:17 ` Wen Congyang
  2016-01-19  7:17 ` [PATCH v5 5/6] tools/libxc: error handling for the postcopy() callback Wen Congyang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxc/xc_sr_save.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index 88d85ef..e532168 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -795,7 +795,7 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
 
             rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
             if ( rc <= 0 )
-                ctx->save.checkpointed = false;
+                goto err;
         }
     } while ( ctx->save.checkpointed );
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 5/6] tools/libxc: error handling for the postcopy() callback
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
                   ` (3 preceding siblings ...)
  2016-01-19  7:17 ` [PATCH v5 4/6] tools/libxc: don't send end record if remus fails Wen Congyang
@ 2016-01-19  7:17 ` Wen Congyang
  2016-01-19  7:17 ` [PATCH v5 6/6] tools/libxl: remove unused function libxl__domain_save_device_model() Wen Congyang
  2016-01-20 17:40 ` [PATCH v5 0/6] migration/remus: bug fix and cleanup Ian Campbell
  6 siblings, 0 replies; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxc/xc_sr_save.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
index e532168..e4ba560 100644
--- a/tools/libxc/xc_sr_save.c
+++ b/tools/libxc/xc_sr_save.c
@@ -791,7 +791,9 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
             if ( rc )
                 goto err;
 
-            ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            rc = ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+            if ( rc <= 0 )
+                goto err;
 
             rc = ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
             if ( rc <= 0 )
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 6/6] tools/libxl: remove unused function libxl__domain_save_device_model()
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
                   ` (4 preceding siblings ...)
  2016-01-19  7:17 ` [PATCH v5 5/6] tools/libxc: error handling for the postcopy() callback Wen Congyang
@ 2016-01-19  7:17 ` Wen Congyang
  2016-01-20 17:40 ` [PATCH v5 0/6] migration/remus: bug fix and cleanup Ian Campbell
  6 siblings, 0 replies; 14+ messages in thread
From: Wen Congyang @ 2016-01-19  7:17 UTC (permalink / raw)
  To: xen devel, Andrew Cooper
  Cc: Changlong Xie, Wei Liu, Ian Campbell, Wen Congyang, Ian Jackson,
	Shriram Rajagopalan, Yang Hongyang

After the commit d77570e7, libxl__domain_save_device_model() is
completely unused and can be dropped.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxl/libxl.c          |  4 --
 tools/libxl/libxl_dom.c      | 91 --------------------------------------------
 tools/libxl/libxl_internal.h |  6 ---
 3 files changed, 101 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index a8718b0..901e8eb 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -1550,10 +1550,6 @@ static void stubdom_destroy_callback(libxl__egc *egc,
     dds->stubdom_finished = 1;
     savefile = libxl__device_model_savefile(gc, dis->domid);
     rc = libxl__remove_file(gc, savefile);
-    /*
-     * On suspend libxl__domain_save_device_model will have already
-     * unlinked the save file.
-     */
     if (rc) {
         LOG(ERROR, "failed to remove device-model savefile %s", savefile);
     }
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 47971a9..2269998 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -1785,97 +1785,6 @@ static void stream_done(libxl__egc *egc,
     domain_save_done(egc, sws->dss, rc);
 }
 
-static void save_device_model_datacopier_done(libxl__egc *egc,
-     libxl__datacopier_state *dc, int rc, int onwrite, int errnoval);
-
-void libxl__domain_save_device_model(libxl__egc *egc,
-                                     libxl__domain_suspend_state *dss,
-                                     libxl__save_device_model_cb *callback)
-{
-    STATE_AO_GC(dss->ao);
-    struct stat st;
-    uint32_t qemu_state_len;
-    int rc;
-
-    dss->save_dm_callback = callback;
-
-    /* Convenience aliases */
-    const char *const filename = dss->dm_savefile;
-    const int fd = dss->fd;
-
-    libxl__datacopier_state *dc = &dss->save_dm_datacopier;
-    memset(dc, 0, sizeof(*dc));
-    dc->readwhat = GCSPRINTF("qemu save file %s", filename);
-    dc->ao = ao;
-    dc->readfd = -1;
-    dc->writefd = fd;
-    dc->maxsz = INT_MAX;
-    dc->bytes_to_read = -1;
-    dc->copywhat = GCSPRINTF("qemu save file for domain %"PRIu32, dss->domid);
-    dc->writewhat = "save/migration stream";
-    dc->callback = save_device_model_datacopier_done;
-
-    dc->readfd = open(filename, O_RDONLY);
-    if (dc->readfd < 0) {
-        LOGE(ERROR, "unable to open %s", dc->readwhat);
-        rc = ERROR_FAIL;
-        goto out;
-    }
-
-    if (fstat(dc->readfd, &st))
-    {
-        LOGE(ERROR, "unable to fstat %s", dc->readwhat);
-        rc = ERROR_FAIL;
-        goto out;
-    }
-
-    if (!S_ISREG(st.st_mode)) {
-        LOG(ERROR, "%s is not a plain file!", dc->readwhat);
-        rc = ERROR_FAIL;
-        goto out;
-    }
-
-    qemu_state_len = st.st_size;
-    LOG(DEBUG, "%s is %d bytes", dc->readwhat, qemu_state_len);
-
-    rc = libxl__datacopier_start(dc);
-    if (rc) goto out;
-
-    libxl__datacopier_prefixdata(egc, dc,
-                                 QEMU_SIGNATURE, strlen(QEMU_SIGNATURE));
-
-    libxl__datacopier_prefixdata(egc, dc,
-                                 &qemu_state_len, sizeof(qemu_state_len));
-    return;
-
- out:
-    save_device_model_datacopier_done(egc, dc, rc, -1, EIO);
-}
-
-static void save_device_model_datacopier_done(libxl__egc *egc,
-     libxl__datacopier_state *dc, int our_rc, int onwrite, int errnoval)
-{
-    libxl__domain_suspend_state *dss =
-        CONTAINER_OF(dc, *dss, save_dm_datacopier);
-    STATE_AO_GC(dss->ao);
-
-    /* Convenience aliases */
-    const char *const filename = dss->dm_savefile;
-    int rc;
-
-    libxl__datacopier_kill(dc);
-
-    if (dc->readfd >= 0) {
-        close(dc->readfd);
-        dc->readfd = -1;
-    }
-
-    rc = libxl__remove_file(gc, filename);
-    if (!our_rc) our_rc = rc;
-
-    dss->save_dm_callback(egc, dss, our_rc);
-}
-
 static void libxl__remus_teardown(libxl__egc *egc,
                                   libxl__domain_suspend_state *dss,
                                   int rc);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index a556a38..233d44a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3103,9 +3103,6 @@ struct libxl__domain_suspend_state {
     libxl__logdirty_switch logdirty;
     void (*callback_common_done)(libxl__egc*,
                                  struct libxl__domain_suspend_state*, int ok);
-    /* private for libxl__domain_save_device_model */
-    libxl__save_device_model_cb *save_dm_callback;
-    libxl__datacopier_state save_dm_datacopier;
 };
 
 
@@ -3498,9 +3495,6 @@ static inline bool libxl__save_helper_inuse(const libxl__save_helper_state *shs)
 /* Each time the dm needs to be saved, we must call suspend and then save */
 _hidden int libxl__domain_suspend_device_model(libxl__gc *gc,
                                            libxl__domain_suspend_state *dss);
-_hidden void libxl__domain_save_device_model(libxl__egc *egc,
-                                     libxl__domain_suspend_state *dss,
-                                     libxl__save_device_model_cb *callback);
 
 _hidden const char *libxl__device_model_savefile(libxl__gc *gc, uint32_t domid);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state
  2016-01-19  7:17 ` [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state Wen Congyang
@ 2016-01-19 10:48   ` Ian Campbell
  2016-01-19 15:35     ` Andrew Cooper
  2016-01-20  0:42     ` Wen Congyang
  0 siblings, 2 replies; 14+ messages in thread
From: Ian Campbell @ 2016-01-19 10:48 UTC (permalink / raw)
  To: Wen Congyang, xen devel, Andrew Cooper
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
> We will have an consistent state when a CHECKPOINT_END record

"a consistent ..." (and in the subject too).

> is received. After the first CHECKPOINT_END record is received,
> we will buffer all records until the next CHECKPOINT_END record
> is received. So if the checkpoint() callback returns
> XGR_CHECKPOINT_FAILOVER,
> we only can do failover if ctx->restore.buffer_all_records is
> true.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---

Please can you get into the habit of writing a delta from the previous
version here. e.g. in this case:

v5: New patch.

Putting it after the --- means it doesn't go into the actual commit ("git
am" will strip it) but it is very useful for reviewers to know what changed
in each iteration.

See also http://wiki.xen.org/wiki/Submitting_Xen_Patches#Review.2C_Rinse_.26_Repeat

>  tools/libxc/xc_sr_restore.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
> index 05159bb..9fe2829 100644
> --- a/tools/libxc/xc_sr_restore.c
> +++ b/tools/libxc/xc_sr_restore.c
> @@ -493,7 +493,11 @@ static int handle_checkpoint(struct xc_sr_context
> *ctx)
>          break;
>  
>      case XGR_CHECKPOINT_FAILOVER:
> -        rc = BROKEN_CHANNEL;
> +        if ( ctx->restore.buffer_all_records )
> +            rc = BROKEN_CHANNEL;
> +        else
> +            /* We don't have an consistent state */

"a" not "an" again.

I can s/an/a/ in all 3 places upon commit, so no need to resend for just
those.

Acked-by: Ian Campbell <ian.campbell@citrix.com>

I'll give Andy a chance to comment before committing though.

> +            rc = -1;
>          goto err;
>  
>      default: /* Other fatal error */

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 2/6] remus: don't call stream_continue() when doing failover
  2016-01-19  7:17 ` [PATCH v5 2/6] remus: don't call stream_continue() when doing failover Wen Congyang
@ 2016-01-19 10:50   ` Ian Campbell
  0 siblings, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2016-01-19 10:50 UTC (permalink / raw)
  To: Wen Congyang, xen devel, Andrew Cooper
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
> stream_continue() is used for migration to read emulator
> xenstore data and emulator context. For remus, if we do
> failover, we have read it in the checkpoint cycle, and
> we only need to complete the stream.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes
  2016-01-19  7:17 ` [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes Wen Congyang
@ 2016-01-19 10:55   ` Ian Campbell
  2016-01-20  0:41     ` Wen Congyang
  0 siblings, 1 reply; 14+ messages in thread
From: Ian Campbell @ 2016-01-19 10:55 UTC (permalink / raw)
  To: Wen Congyang, xen devel, Andrew Cooper
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
> For example: if the secondary host is down, and we fail to send the data
> to
> the secondary host. xc_domain_save() returns 0. So in the function
> libxl__xc_domain_save_done(), rc is 0 (the helper program exits
> normally),
> and retval is 0 (it is xc_domain_save()'s return value). In such case, we
> just need to complete the stream.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  tools/libxl/libxl.h              |  4 ++++
>  tools/libxl/libxl_stream_write.c | 14 ++++++++++++--
>  2 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> index 7114491..df6c7a3 100644
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -1215,6 +1215,10 @@ int libxl_domain_resume(libxl_ctx *ctx, uint32_t
> domid, int suspend_cancel,
>                          const libxl_asyncop_how *ao_how)
>                          LIBXL_EXTERNAL_CALLERS_ONLY;
>  
> +/*
> + * This function doesn't return until something is wrong, and we need to
> + * do failover from secondary.

This function runs on the primary, doesn't it? and failover would be from
primary to secondary.

So I think a more accurate wording would be:

/*
 * This function doesn't return unless something has gone wrong with the
 * replication to the secondary. If this function returns then the caller 
 * should resume the (primary) domain.
 */

I'm happy to edit the text on commit if you agree with the proposed
wording. The code looks good.

Thanks,
Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state
  2016-01-19 10:48   ` Ian Campbell
@ 2016-01-19 15:35     ` Andrew Cooper
  2016-01-20  0:42     ` Wen Congyang
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Cooper @ 2016-01-19 15:35 UTC (permalink / raw)
  To: Ian Campbell, Wen Congyang, xen devel
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On 19/01/16 10:48, Ian Campbell wrote:
> On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
>> We will have an consistent state when a CHECKPOINT_END record
> "a consistent ..." (and in the subject too).
>
>> is received. After the first CHECKPOINT_END record is received,
>> we will buffer all records until the next CHECKPOINT_END record
>> is received. So if the checkpoint() callback returns
>> XGR_CHECKPOINT_FAILOVER,
>> we only can do failover if ctx->restore.buffer_all_records is
>> true.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
> Please can you get into the habit of writing a delta from the previous
> version here. e.g. in this case:
>
> v5: New patch.
>
> Putting it after the --- means it doesn't go into the actual commit ("git
> am" will strip it) but it is very useful for reviewers to know what changed
> in each iteration.
>
> See also http://wiki.xen.org/wiki/Submitting_Xen_Patches#Review.2C_Rinse_.26_Repeat
>
>>  tools/libxc/xc_sr_restore.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
>> index 05159bb..9fe2829 100644
>> --- a/tools/libxc/xc_sr_restore.c
>> +++ b/tools/libxc/xc_sr_restore.c
>> @@ -493,7 +493,11 @@ static int handle_checkpoint(struct xc_sr_context
>> *ctx)
>>          break;
>>  
>>      case XGR_CHECKPOINT_FAILOVER:
>> -        rc = BROKEN_CHANNEL;
>> +        if ( ctx->restore.buffer_all_records )
>> +            rc = BROKEN_CHANNEL;
>> +        else
>> +            /* We don't have an consistent state */
> "a" not "an" again.
>
> I can s/an/a/ in all 3 places upon commit, so no need to resend for just
> those.
>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>, with suggested
corrections

>
> I'll give Andy a chance to comment before committing though.
>
>> +            rc = -1;
>>          goto err;
>>  
>>      default: /* Other fatal error */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes
  2016-01-19 10:55   ` Ian Campbell
@ 2016-01-20  0:41     ` Wen Congyang
  0 siblings, 0 replies; 14+ messages in thread
From: Wen Congyang @ 2016-01-20  0:41 UTC (permalink / raw)
  To: Ian Campbell, xen devel, Andrew Cooper
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On 01/19/2016 06:55 PM, Ian Campbell wrote:
> On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
>> For example: if the secondary host is down, and we fail to send the data
>> to
>> the secondary host. xc_domain_save() returns 0. So in the function
>> libxl__xc_domain_save_done(), rc is 0 (the helper program exits
>> normally),
>> and retval is 0 (it is xc_domain_save()'s return value). In such case, we
>> just need to complete the stream.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>>  tools/libxl/libxl.h              |  4 ++++
>>  tools/libxl/libxl_stream_write.c | 14 ++++++++++++--
>>  2 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
>> index 7114491..df6c7a3 100644
>> --- a/tools/libxl/libxl.h
>> +++ b/tools/libxl/libxl.h
>> @@ -1215,6 +1215,10 @@ int libxl_domain_resume(libxl_ctx *ctx, uint32_t
>> domid, int suspend_cancel,
>>                          const libxl_asyncop_how *ao_how)
>>                          LIBXL_EXTERNAL_CALLERS_ONLY;
>>  
>> +/*
>> + * This function doesn't return until something is wrong, and we need to
>> + * do failover from secondary.
> 
> This function runs on the primary, doesn't it? and failover would be from
> primary to secondary.

Yes, it runs on the primary

> 
> So I think a more accurate wording would be:
> 
> /*
>  * This function doesn't return unless something has gone wrong with the
>  * replication to the secondary. If this function returns then the caller 
>  * should resume the (primary) domain.
>  */
> 
> I'm happy to edit the text on commit if you agree with the proposed
> wording. The code looks good.

I agree with that.

Thanks
Wen Congyang

> 
> Thanks,
> Ian.
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state
  2016-01-19 10:48   ` Ian Campbell
  2016-01-19 15:35     ` Andrew Cooper
@ 2016-01-20  0:42     ` Wen Congyang
  1 sibling, 0 replies; 14+ messages in thread
From: Wen Congyang @ 2016-01-20  0:42 UTC (permalink / raw)
  To: Ian Campbell, xen devel, Andrew Cooper
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On 01/19/2016 06:48 PM, Ian Campbell wrote:
> On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
>> We will have an consistent state when a CHECKPOINT_END record
> 
> "a consistent ..." (and in the subject too).
> 
>> is received. After the first CHECKPOINT_END record is received,
>> we will buffer all records until the next CHECKPOINT_END record
>> is received. So if the checkpoint() callback returns
>> XGR_CHECKPOINT_FAILOVER,
>> we only can do failover if ctx->restore.buffer_all_records is
>> true.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
> 
> Please can you get into the habit of writing a delta from the previous
> version here. e.g. in this case:
> 
> v5: New patch.
> 
> Putting it after the --- means it doesn't go into the actual commit ("git
> am" will strip it) but it is very useful for reviewers to know what changed
> in each iteration.
> 
> See also http://wiki.xen.org/wiki/Submitting_Xen_Patches#Review.2C_Rinse_.26_Repeat
> 
>>  tools/libxc/xc_sr_restore.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
>> index 05159bb..9fe2829 100644
>> --- a/tools/libxc/xc_sr_restore.c
>> +++ b/tools/libxc/xc_sr_restore.c
>> @@ -493,7 +493,11 @@ static int handle_checkpoint(struct xc_sr_context
>> *ctx)
>>          break;
>>  
>>      case XGR_CHECKPOINT_FAILOVER:
>> -        rc = BROKEN_CHANNEL;
>> +        if ( ctx->restore.buffer_all_records )
>> +            rc = BROKEN_CHANNEL;
>> +        else
>> +            /* We don't have an consistent state */
> 
> "a" not "an" again.
> 
> I can s/an/a/ in all 3 places upon commit, so no need to resend for just
> those.

OK, thanks for your help

Wen Congyang

> 
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> I'll give Andy a chance to comment before committing though.
> 
>> +            rc = -1;
>>          goto err;
>>  
>>      default: /* Other fatal error */
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 0/6] migration/remus: bug fix and cleanup
  2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
                   ` (5 preceding siblings ...)
  2016-01-19  7:17 ` [PATCH v5 6/6] tools/libxl: remove unused function libxl__domain_save_device_model() Wen Congyang
@ 2016-01-20 17:40 ` Ian Campbell
  6 siblings, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2016-01-20 17:40 UTC (permalink / raw)
  To: Wen Congyang, xen devel, Andrew Cooper
  Cc: Shriram Rajagopalan, Wei Liu, Changlong Xie, Ian Jackson,
	Yang Hongyang

On Tue, 2016-01-19 at 15:17 +0800, Wen Congyang wrote:
> Wen Congyang (6):
>   remus: don't do failover if we don't have an consistent state
>   remus: don't call stream_continue() when doing failover
>   remus: resume immediately if libxl__xc_domain_save_done() completes
>   tools/libxc: don't send end record if remus fails
>   tools/libxc: error handling for the postcopy() callback
>   tools/libxl: remove unused function libxl__domain_save_device_model()

I've now pushed all of these with the minor changes discussed, thanks,

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-01-20 17:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-19  7:17 [PATCH v5 0/6] migration/remus: bug fix and cleanup Wen Congyang
2016-01-19  7:17 ` [PATCH v5 1/6] remus: don't do failover if we don't have an consistent state Wen Congyang
2016-01-19 10:48   ` Ian Campbell
2016-01-19 15:35     ` Andrew Cooper
2016-01-20  0:42     ` Wen Congyang
2016-01-19  7:17 ` [PATCH v5 2/6] remus: don't call stream_continue() when doing failover Wen Congyang
2016-01-19 10:50   ` Ian Campbell
2016-01-19  7:17 ` [PATCH v5 3/6] remus: resume immediately if libxl__xc_domain_save_done() completes Wen Congyang
2016-01-19 10:55   ` Ian Campbell
2016-01-20  0:41     ` Wen Congyang
2016-01-19  7:17 ` [PATCH v5 4/6] tools/libxc: don't send end record if remus fails Wen Congyang
2016-01-19  7:17 ` [PATCH v5 5/6] tools/libxc: error handling for the postcopy() callback Wen Congyang
2016-01-19  7:17 ` [PATCH v5 6/6] tools/libxl: remove unused function libxl__domain_save_device_model() Wen Congyang
2016-01-20 17:40 ` [PATCH v5 0/6] migration/remus: bug fix and cleanup Ian Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).