From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: [RFC PATCH 2/3] remus: implement remus checkpoint in v2 save Date: Wed, 16 Jul 2014 16:38:56 +0100 Message-ID: <53C69C90.1040908@citrix.com> References: <1404892050-24650-1-git-send-email-yanghy@cn.fujitsu.com> <1404892050-24650-3-git-send-email-yanghy@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6499459271093713481==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: rshriram@cs.ubc.ca, Yang Hongyang Cc: Ian Jackson , Ian Campbell , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org --===============6499459271093713481== Content-Type: multipart/alternative; boundary="------------000403020503080604080509" --------------000403020503080604080509 Content-Type: text/plain; charset="UTF-8" Content-Length: 8589 Content-Transfer-Encoding: quoted-printable On 16/07/14 16:22, Shriram Rajagopalan wrote: > > > > On Wed, Jul 9, 2014 at 3:47 AM, Yang Hongyang > wrote: > > implement remus checkpoint in v2 save > > Signed-off-by: Yang Hongyang > > --- > tools/libxc/saverestore/common.h | 1 + > tools/libxc/saverestore/save.c | 88 > ++++++++++++++++++++++++---------------- > 2 files changed, 55 insertions(+), 34 deletions(-) > > diff --git a/tools/libxc/saverestore/common.h > b/tools/libxc/saverestore/common.h > index 24ba95b..1dd9f51 100644 > --- a/tools/libxc/saverestore/common.h > +++ b/tools/libxc/saverestore/common.h > @@ -153,6 +153,7 @@ struct xc_sr_context > > xc_dominfo_t dominfo; > bool checkpointed; > + bool firsttime; > > union > { > diff --git a/tools/libxc/saverestore/save.c > b/tools/libxc/saverestore/save.c > index d2fa8a6..98a5c2f 100644 > --- a/tools/libxc/saverestore/save.c > +++ b/tools/libxc/saverestore/save.c > @@ -375,6 +375,8 @@ static int send_domain_memory_live(struct > xc_sr_context *ctx) > goto out; > } > > + if ( ctx->checkpointed && !ctx->firsttime ) > + goto lastiter; > > > nit: last_iter > > I suggest adding a comment before this code block to explain what has > happened so > far above why we are jumping to the last_iter block skipping the rest > of the stuff below. > > I also suggest maintaining some stats structure somewhere (number of > dirty pages, > time when checkpoint was initiated, etc.). They could be simply debug > prints that > can be enabled on demand. > > A better alternative would be to somehow pass this stats structure > back to libxl, > such that the user can enable/disable stats printing using the xl > remus command > for example. This is partly my fault. I so far haven=E2=80=99t got any of the "progress report" set of logging functionality wired up. Once that is done, libxl should be fine, as it can already filter the progress logging from the regular logging. > > > /* This juggling is required if logdirty is already on, e.g. > VRAM tracking */ > if ( xc_shadow_control(xch, ctx->domid, > XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, > @@ -436,6 +438,7 @@ static int send_domain_memory_live(struct > xc_sr_context *ctx) > break; > } > > +lastiter: > rc =3D suspend_domain(ctx); > if ( rc ) > goto out; > > > I hope the postsuspend callbacks are called somewhere=3F > > > @@ -570,44 +573,60 @@ static int save(struct xc_sr_context *ctx, > uint16_t guest_type) > if ( rc ) > goto err; > > - rc =3D ctx->save.ops.start_of_stream(ctx); > - if ( rc ) > - goto err; > + do { > + rc =3D ctx->save.ops.start_of_stream(ctx); > + if ( rc ) > + goto err; > > - if ( ctx->save.live ) > - { > - DPRINTF("Starting live migrate"); > - rc =3D send_domain_memory_live(ctx); > - } > - else > - { > - DPRINTF("Starting nonlive save"); > - rc =3D send_domain_memory_nonlive(ctx); > - } > + if ( ctx->save.live ) > + { > + DPRINTF("Starting live migrate"); > > > I suggest using the ctx->firsttime bool var before this printf, so that > when debug print is enabled under remus, the console is not > flooded with the same statement that makes no sense past the > first iteration. A thought has just crossed my mind. It might be better to have a "send_domain_remus()" function which internally deals with the iterations and looping etc. e.g. if ( remus ) rc =3D send_domain_remus(ctx); else if ( ctx->save.live ) rc =3D send_domain_live(ctx); else rc =3D send_domain_memory_nonlive(ctx); In my upcoming series which fixes the deferred vcpu state problem I have split some of the common bits out of send_domain_memory_{live,nonline}() which might be useful. > > > + rc =3D send_domain_memory_live(ctx); > + } > + else > + { > + DPRINTF("Starting nonlive save"); > + rc =3D send_domain_memory_nonlive(ctx); > + } > > - if ( rc ) > - goto err; > + if ( rc ) > + goto err; > > - /* Refresh domain information now it has paused. */ > - if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) !=3D > 1) || > - (ctx->dominfo.domid !=3D ctx->domid) ) > - { > - PERROR("Unable to refresh domain information"); > - rc =3D -1; > - goto err; > - } > - else if ( (!ctx->dominfo.shutdown || > - ctx->dominfo.shutdown_reason !=3D SHUTDOWN_suspend ) && > - !ctx->dominfo.paused ) > - { > - ERROR("Domain has not been suspended"); > - rc =3D -1; > - goto err; > - } > + /* Refresh domain information now it has paused. */ > + if ( (xc_domain_getinfo(xch, ctx->domid, 1, > &ctx->dominfo) !=3D 1) || > + (ctx->dominfo.domid !=3D ctx->domid) ) > + { > + PERROR("Unable to refresh domain information"); > + rc =3D -1; > + goto err; > + } > + else if ( (!ctx->dominfo.shutdown || > + ctx->dominfo.shutdown_reason !=3D > SHUTDOWN_suspend ) && > + !ctx->dominfo.paused ) > + { > + ERROR("Domain has not been suspended"); > + rc =3D -1; > + goto err; > + } > > > A silly question: Shouldn't this check for 'suspended status' be > before the call to > send_domain_memory_live() [under remus]. I am assuming that the > send_domain_memory_live() function is the one that sends the dirty > page data > out on the wire. Even for non-live migrates, we have to ensure that the VM has entered the suspend state. A PV guest cannot possibly recover from resume if it didn't voluntarily suspend as it will loose its reference to the start info page (whos mfn lives in vcpu0.edx for the duration of the migrate and must be updated on the receiving side). Any VM which doesn't sufficiently quiesce its fronends risks ring and memory corruption on resume. In this case, the send_domain_memory_XXX functions do take care of pausing the domain at the appropriate time, but this here is a sanity check. ~Andrew > > > > - rc =3D ctx->save.ops.end_of_stream(ctx); > - if ( rc ) > - goto err; > + rc =3D ctx->save.ops.end_of_stream(ctx); > + if ( rc ) > + goto err; > + > + if ( ctx->checkpointed ) { > + if ( ctx->firsttime ) > + ctx->firsttime =3D false; > + > + ctx->save.callbacks->postcopy(ctx->save.callbacks->data); > + > + rc =3D > ctx->save.callbacks->checkpoint(ctx->save.callbacks->data); > + if ( rc > 0 ) { > + IPRINTF("Next checkpoint\n"); > > > I suggest maintaining checkpoint numbers instead. Much more helpful. > Probably even add > a gettimeofday call to print out the time the new checkpoint is > started. Once again, its useful > to be able to control this printout from libxl > > + } else { > + ctx->checkpointed =3D false; > + } > + } > + } while ( ctx->checkpointed ); > > rc =3D write_end_record(ctx); > if ( rc ) > @@ -653,6 +672,7 @@ int xc_domain_save2(xc_interface *xch, int > io_fd, uint32_t dom, uint32_t max_ite > ctx.save.live =3D !!(flags & XCFLAGS_LIVE); > ctx.save.debug =3D !!(flags & XCFLAGS_DEBUG); > ctx.checkpointed =3D !!(flags & XCFLAGS_CHECKPOINTED); > + ctx.firsttime =3D true; > > if ( ctx.checkpointed ) { > /* This is a checkpointed save, we need these callbacks */ > -- > 1.9.1 > > --------------000403020503080604080509 Content-Type: text/html; charset="UTF-8" Content-Length: 17285 Content-Transfer-Encoding: quoted-printable
On 16/07/14 16:22, Shriram Rajagopalan wrote:



On Wed, Jul 9, 2014 at 3:47 AM, Yang Hongyang <yanghy@cn.fujitsu.com> wrote:
implement remus checkpoint in v2 save

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
=C2=A0tools/libxc/saverestore/common.h | =C2=A01 +
=C2=A0tools/libxc/saverestore/save.c =C2=A0 | 88 ++++++++++++++++++++++++----------------
=C2=A02 files changed, 55 insertions(+), 34 deletions(-)

diff --git a/tools/libxc/saverestore/common.h b/tools/libxc/saverestore/common.h
index 24ba95b..1dd9f51 100644
--- a/tools/libxc/saverestore/common.h
+++ b/tools/libxc/saverestore/common.h
@@ -153,6 +153,7 @@ struct xc_sr_context

=C2=A0 =C2=A0 =C2=A0xc_dominfo_t dominfo;
=C2=A0 =C2=A0 =C2=A0bool checkpointed;
+ =C2=A0 =C2=A0bool firsttime;

=C2=A0 =C2=A0 =C2=A0union
=C2=A0 =C2=A0 =C2=A0{
diff --git a/tools/libxc/saverestore/save.c b/tools/libxc/saverestore/save.c
index d2fa8a6..98a5c2f 100644
--- a/tools/libxc/saverestore/save.c
+++ b/tools/libxc/saverestore/save.c
@@ -375,6 +375,8 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out;
=C2=A0 =C2=A0 =C2=A0}

+ =C2=A0 =C2=A0if ( ctx->checkpointed && !ctx->firsttime )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0goto lastiter;

nit: last_iter

I suggest adding a comment before this code block to explain what has happened so=C2=A0
far above why we are jumping to the last_iter block skipping the rest of the stuff below.

I also suggest maintaining some stats structure somewhere (number of dirty pages,=C2=A0
time when checkpoint was initiated, etc.). They could be simply debug prints that
can be enabled on demand.

A better alternative would be to somehow pass this stats structure back to libxl,
such that the user can enable/disable stats printing using the xl remus command
for example.

This is partly my fault.=C2=A0 I so far haven=E2=80=99t got any of the "progress report" set of logging functionality wired up.=C2=A0 Once that is done, libxl should be fine, as it can already filter the progress logging from the regular logging.

=C2=A0
=C2=A0 =C2=A0 =C2=A0/* This juggling is required if logdirty is already on, e.g. VRAM tracking */
=C2=A0 =C2=A0 =C2=A0if ( xc_shadow_control(xch, ctx->domid,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
@@ -436,6 +438,7 @@ static int send_domain_memory_live(struct xc_sr_context *ctx)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break;
=C2=A0 =C2=A0 =C2=A0}

+lastiter:
=C2=A0 =C2=A0 =C2=A0rc =3D suspend_domain(ctx);
=C2=A0 =C2=A0 =C2=A0if ( rc )
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto out;

I hope the postsuspend callbacks are called somewhere=3F
=C2=A0
@@ -570,44 +573,60 @@ static int save(struct xc_sr_context *ctx, uint16_t guest_type)
=C2=A0 =C2=A0 =C2=A0if ( rc )
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;

- =C2=A0 =C2=A0rc =3D ctx->save.ops.start_of_stream(ctx);
- =C2=A0 =C2=A0if ( rc )
- =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
+ =C2=A0 =C2=A0do {
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D ctx->save.ops.start_of_stream(ctx);
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( rc )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;

- =C2=A0 =C2=A0if ( ctx->save.live )
- =C2=A0 =C2=A0{
- =C2=A0 =C2=A0 =C2=A0 =C2=A0DPRINTF("Starting live migrate");
- =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D send_domain_memory_live(ctx);
- =C2=A0 =C2=A0}
- =C2=A0 =C2=A0else
- =C2=A0 =C2=A0{
- =C2=A0 =C2=A0 =C2=A0 =C2=A0DPRINTF("Starting nonlive save");
- =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D send_domain_memory_nonlive(ctx);
- =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( ctx->save.live )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0{
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DPRINTF("Starting live migrate");

I suggest using the ctx->firsttime bool var before this printf, so that=C2=A0
when debug print is enabled under remus, the console is not
flooded with the same statement that makes no sense past the
first iteration.

A thought has just crossed my mind.=C2=A0 It might be better to have a "send_domain_remus()" function which internally deals with the iterations and looping etc.

e.g.

if ( remus )
=C2=A0=C2=A0=C2=A0 rc =3D send_domain_remus(ctx);
else if ( ctx->save.live )
=C2=A0=C2=A0=C2=A0 rc =3D send_domain_live(ctx);
else
=C2=A0=C2=A0=C2=A0 rc =3D send_domain_memory_nonlive(ctx);

In my upcoming series which fixes the deferred vcpu state problem I have split some of the common bits out of send_domain_memory_{live,nonline}() which might be useful.

=C2=A0
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D send_domain_memory_live(ctx);
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0else
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0{
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DPRINTF("Starting nonlive save");
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D send_domain_memory_nonlive(ctx);
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0}

- =C2=A0 =C2=A0if ( rc )
- =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( rc )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;

- =C2=A0 =C2=A0/* Refresh domain information now it has paused. */
- =C2=A0 =C2=A0if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) !=3D 1) ||
- =C2=A0 =C2=A0 =C2=A0 =C2=A0 (ctx->dominfo.domid !=3D ctx->domid) )
- =C2=A0 =C2=A0{
- =C2=A0 =C2=A0 =C2=A0 =C2=A0PERROR("Unable to refresh domain information");
- =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D -1;
- =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
- =C2=A0 =C2=A0}
- =C2=A0 =C2=A0else if ( (!ctx->dominfo.shutdown ||
- =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ctx->dominfo.shutdown_reason !=3D SHUTDOWN_suspend ) &&
- =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0!ctx->dominfo.paused )
- =C2=A0 =C2=A0{
- =C2=A0 =C2=A0 =C2=A0 =C2=A0ERROR("Domain has not been suspended");
- =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D -1;
- =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
- =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Refresh domain information now it has paused. */
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( (xc_domain_getinfo(xch, ctx->domid, 1, &ctx->dominfo) !=3D 1) ||
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (ctx->dominfo.domid !=3D ctx->domid) )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0{
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0PERROR("Unable to refresh domain information");
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D -1;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0else if ( (!ctx->dominfo.shutdown ||
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctx->dominfo.shutdown_reason !=3D SHUTDOWN_suspend ) &&
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0!ctx->dominfo.paused )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0{
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ERROR("Domain has not been suspended");
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D -1;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0}


A silly question: Shouldn't this check for 'suspended status' be before the call to=C2=A0
send_domain_memory_live() [under remus]. I am assuming that the
send_domain_memory_live() function is the one that sends the dirty page data
out on the wire.

Even for non-live migrates, we have to ensure that the VM has entered the suspend state.=C2=A0 A PV guest cannot possibly recover from resume if it didn't voluntarily suspend as it will loose its reference to the start info page (whos mfn lives in vcpu0.edx for the duration of the migrate and must be updated on the receiving side).=C2=A0 Any VM which doesn't sufficiently quiesce its fronends risks ring and memory corruption on resume.

In this case, the send_domain_memory_XXX functions do take care of pausing the domain at the appropriate time, but this here is a sanity check.

~Andrew


=C2=A0
- =C2=A0 =C2=A0rc =3D ctx->save.ops.end_of_stream(ctx);
- =C2=A0 =C2=A0if ( rc )
- =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D ctx->save.ops.end_of_stream(ctx);
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( rc )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0goto err;
+
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( ctx->checkpointed ) {
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( ctx->firsttime )
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctx->firsttime =3D false;
+
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctx->save.callbacks->postcopy(ctx->save.callbacks->data);
+
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rc =3D ctx->save.callbacks->checkpoint(ctx->save.callbacks->data);
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ( rc > 0 ) {
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0IPRINTF("Next checkpoint\n");

I suggest maintaining checkpoint numbers instead. Much more helpful. Probably even add
a gettimeofday call to print out the time the new checkpoint is started. Once again, its useful
to be able to control this printout from libxl

+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} else {
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctx->checkpointed =3D false;
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+ =C2=A0 =C2=A0} while ( ctx->checkpointed );

=C2=A0 =C2=A0 =C2=A0rc =3D write_end_record(ctx);
=C2=A0 =C2=A0 =C2=A0if ( rc )
@@ -653,6 +672,7 @@ int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_ite
=C2=A0 =C2=A0 =C2=A0ctx.save.live =C2=A0=3D !!(flags & XCFLAGS_LIVE);
=C2=A0 =C2=A0 =C2=A0ctx.save.debug =3D !!(flags & XCFLAGS_DEBUG);
=C2=A0 =C2=A0 =C2=A0ctx.checkpointed =3D !!(flags & XCFLAGS_CHECKPOINTED);
+ =C2=A0 =C2=A0ctx.firsttime =3D true;

=C2=A0 =C2=A0 =C2=A0if ( ctx.checkpointed ) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* This is a checkpointed save, we need these callbacks */
--
1.9.1



--------------000403020503080604080509-- --===============6499459271093713481== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============6499459271093713481==--