On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
Commit 00a4b65f8534c9e6521eab2e6ce796ae36037774 Sep 7 2010
"libxc: provide notification of final checkpoint to restore end"
broke migration from any version of Xen using tools from prior to that commit
Older tools have no idea about an XC_SAVE_ID_LAST_CHECKPOINT, causing newer
tools xc_domain_restore() to start reading the qemu save record, as
ctx->last_checkpoint is 0.
The failure looks like:Changes since v1:
xc: error: Max batch size exceeded (1970103633). Giving up.
where 1970103633 = 0x756d6551 = *(uint32_t*)"Qemu"
Sadly, the simple fix of just setting ctx->last_checkpoint = 1 will cause an
opposite function regresson for anyone using the current behaviour of
save_callbacks->checkpoint().
The only safe fix is to rely on the toolstack to provide this information.
Passing 0 results in unchanged behaviour, while passing nonzero means "the
other end of the migration stream does not know about
XC_SAVE_ID_LAST_CHECKPOINT but is performing a normal migrate"
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
* Rename "last_checkpoint_unaware" to "checkpointed_stream" to be more
generic yet still accurate as to the meaning and fix for the issue
---tools/libxc/xenguest.h | 3 ++-
tools/libxc/xc_domain_restore.c | 3 ++-
tools/libxc/xc_nomigrate.c | 2 +-
tools/libxl/libxl_save_helper.c | 2 +-index 63d36cd..e50b185 100644
tools/xcutils/xc_restore.c | 2 +-
5 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
--- a/tools/libxc/xc_domain_restore.c+ int no_incr_generationid, int checkpointed_stream,
+++ b/tools/libxc/xc_domain_restore.c
@@ -1401,7 +1401,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
domid_t store_domid, unsigned int console_evtchn,
unsigned long *console_mfn, domid_t console_domid,
unsigned int hvm, unsigned int pae, int superpages,
- int no_incr_generationid,
unsigned long *vm_generationid_addr,+ ctx->last_checkpoint = !!checkpointed_stream;
struct restore_callbacks *callbacks)
{
@@ -1472,6 +1472,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
ctx->superpages = superpages;
ctx->hvm = hvm;
shouldnt it be a single unary NOT ?i.e., ctx->last_checkpoint = !checkpointed_stream;
Otherwise, I am a bit confused. Basically, there is a memset(ctx, 0) right before this.And you seem to be passing checkpointed_stream = 0 via both xc_restore andlibxl which basically sets it back to ctx->last_checkpoint = 0.
Also, after getting pages from the last iteration, the code goes like this:if (ctx->last_checkpoint)goto finish;get pagebuf or goto finish on errorget tailbuf or goto finish on errorgoto loadpages;
finish:...
and you setting ctx->last_checkpoint = 0 basically means that you arebanking on the far end (with older version of tools) to close the socket, causing"get pagebuf" to fail and subsequently land on finish.IIRC, this was the behavior before XC_SAVE_ID_LAST_CHECKPOINT was introducedby Ian Campbell, to get rid of this benign error message.
"This allows the restore end to finish up without waiting for aspurious timeout on the receive fd and thereby avoids unnecessaryerror logging in the case of a successful migration or restore."
Further,"In the normal migration or restore case the first checkpoint is alwaysthe last. For a rolling checkpoint (such as Remus) the notification iscurrently unused "