From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shriram Rajagopalan Subject: Re: [PATCH v5 00/21] libxl: domain save/restore: run in a separate process Date: Wed, 27 Jun 2012 12:42:41 -0400 Message-ID: References: <1340733318-21099-1-git-send-email-ian.jackson@eu.citrix.com> <20457.63648.864838.199205@mariner.uk.xensource.com> <20459.3767.440030.931642@mariner.uk.xensource.com> <20459.11730.332905.175948@mariner.uk.xensource.com> Reply-To: rshriram@cs.ubc.ca Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8834480936510365647==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Jackson Cc: "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org --===============8834480936510365647== Content-Type: multipart/alternative; boundary=000e0ce0d65eb6a3a704c376e9b8 --000e0ce0d65eb6a3a704c376e9b8 Content-Type: text/plain; charset=ISO-8859-1 On Wed, Jun 27, 2012 at 12:09 PM, Shriram Rajagopalan wrote: > On Wed, Jun 27, 2012 at 11:59 AM, Ian Jackson wrote: > >> Ian Jackson writes ("Re: [PATCH v5 00/21] libxl: domain save/restore: run >> in a separate process"): >> > However, when I apply my series, I can indeed produce an assertion >> > failure: >> ... >> > So I have indeed made matters worse. >> >> I found two bugs: >> >> 1. The void* passed to the callback was being treated as a >> libxl__domain_suspend_state* by the remus callbacks; this is a >> holdover from a much earlier version of the series. It should be >> converted to a libxl__save_helper_state and then the dss extracted >> with CONTAINER_OF. >> >> 2. The way remus works means that the toolstack save callback is >> invoked more than once, which the helper's implementation was not >> prepared to deal with. Fix this by moving the rewind of the fd into >> the helper. >> >> Fixes for these are below. With this, on top of my series, seem to I >> get the same behaviour as with the baseline. Would you like to try it ? >> >> > Sure, I ll give it a shot. > Btw, my earlier mail was in response to remus not > working on the baseline setup on your dev environment. > > The fix works for 2 out of 3 cases blackhole replication (xl remus -b) localhost replication with failover i.e. destroy primary (xl remus domU localhost) However, it crashes the guest for localhost replication, when I destroy the backup i.e. xl destroy domU--incoming . The primary guest would generally resume, but in this case its in --sc- state. NB: This seems to happen in baseline xen-unstable too!. xc: error: unexpected PFN mapping failure pfn 180e map_mfn 43b808 p2m_mfn 43b808: Internal error libxl: error: libxl_create.c:760:libxl__xc_domain_restore_done: restoring domain: Resource temporarily unavailable libxl: error: libxl_create.c:844:domcreate_rebuild_done: cannot (re-)build domain: -3 libxl: error: libxl.c:1220:libxl_domain_destroy: non-existant domain 17 libxl: error: libxl_create.c:995:domcreate_complete: unable to destroy domain 17 following failed creation migration target: Domain creation failed (code -3). .. Total Data Sent= 12.597 MB libxl: debug: libxl_dom.c:801:libxl__domain_suspend_common_callback: issuing PV suspend request via XenBus control node libxl: debug: libxl_dom.c:805:libxl__domain_suspend_common_callback: wait for the guest to acknowledge suspend request libxl: debug: libxl_dom.c:852:libxl__domain_suspend_common_callback: guest acknowledged suspend request libxl: debug: libxl_dom.c:856:libxl__domain_suspend_common_callback: wait for the guest to suspend libxl: debug: libxl_dom.c:870:libxl__domain_suspend_common_callback: guest has suspended pagetables=2,cache_misses=0,emptypages=45 libxl: error: libxl_utils.c:363:libxl_read_exactly: file/stream truncated reading ipc msg header from domain 16 save/restore helper stdout pipe libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: domain 16 save/restore helper [3148] died due to fatal signal Broken pipe libxl: debug: libxl_event.c:1434:libxl__ao_complete: ao 0x1b08c80: complete, rc=-3 libxl: debug: libxl_event.c:1406:libxl__ao__destroy: ao 0x1b08c80: destroy remus sender: libxl_domain_suspend failed (rc=-3) Remus: Backup failed? resuming domain at primary. xc: debug: hypercall buffer: total allocations:2116 total releases:2116 xc: debug: hypercall buffer: current allocations:0 maximum allocations:2 xc: debug: hypercall buffer: cache current size:2 xc: debug: hypercall buffer: cache hits:1729 misses:2 toobig:385 > > >> Thanks, >> Ian. >> >> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c >> index abc5932..069aca1 100644 >> --- a/tools/libxl/libxl_dom.c >> +++ b/tools/libxl/libxl_dom.c >> @@ -984,7 +984,8 @@ static int libxl__remus_domain_suspend_callback(void >> *data) >> >> static int libxl__remus_domain_resume_callback(void *data) >> { >> - libxl__domain_suspend_state *dss = data; >> + libxl__save_helper_state *shs = data; >> + libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs); >> STATE_AO_GC(dss->ao); >> >> /* Resumes the domain and the device model */ >> @@ -1002,7 +1003,8 @@ static void remus_checkpoint_dm_saved(libxl__egc >> *egc, >> >> static void libxl__remus_domain_checkpoint_callback(void *data) >> { >> - libxl__domain_suspend_state *dss = data; >> + libxl__save_helper_state *shs = data; >> + libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs); >> libxl__egc *egc = dss->shs.egc; >> STATE_AO_GC(dss->ao); >> >> diff --git a/tools/libxl/libxl_save_callout.c >> b/tools/libxl/libxl_save_callout.c >> index 6332beb..078b7ee 100644 >> --- a/tools/libxl/libxl_save_callout.c >> +++ b/tools/libxl/libxl_save_callout.c >> @@ -105,13 +105,6 @@ void libxl__xc_domain_save(libxl__egc *egc, >> libxl__domain_suspend_state *dss, >> toolstack_data_buf, toolstack_data_len, >> "toolstack data tmpfile", 0); >> if (r) { rc = ERROR_FAIL; goto out; } >> - >> - r = lseek(toolstack_data_fd, 0, SEEK_SET); >> - if (r) { >> - LOGE(ERROR, "rewind toolstack data tmpfile"); >> - rc = ERROR_FAIL; >> - goto out; >> - } >> } >> >> const unsigned long argnums[] = { >> diff --git a/tools/libxl/libxl_save_helper.c >> b/tools/libxl/libxl_save_helper.c >> index 3bdfa28..772251a 100644 >> --- a/tools/libxl/libxl_save_helper.c >> +++ b/tools/libxl/libxl_save_helper.c >> @@ -171,12 +171,14 @@ static int toolstack_save_cb(uint32_t domid, >> uint8_t **buf, >> { >> assert(toolstack_save_fd > 0); >> >> + int r = lseek(toolstack_save_fd, 0, SEEK_SET); >> + if (r) fail(errno,"rewind toolstack data tmpfile"); >> + >> *buf = xmalloc(toolstack_save_len); >> - int r = read_exactly(toolstack_save_fd, *buf, toolstack_save_len); >> + r = read_exactly(toolstack_save_fd, *buf, toolstack_save_len); >> if (r<0) fail(errno,"read toolstack data"); >> if (r==0) fail(0,"read toolstack data eof"); >> >> - toolstack_save_fd = -1; >> *len = toolstack_save_len; >> return 0; >> } >> >> > --000e0ce0d65eb6a3a704c376e9b8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Wed, Jun 27, 2012 at 12:09 PM, Shriram Rajagopalan <= ;rshriram@cs.ubc.ca= > wrote:
On Wed, Jun 27, 2012 at 11:59 AM, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
Ian Jackson writes ("Re: [PATCH v5 00/21] libxl: domain save/restore: = run in a separate process"):
> However, when I apply my series, I can indeed produce an assertio= n
> failure:
...
> So I have indeed made matters worse.

I found two bugs:

1. The void* passed to the callback was being treated as a
libxl__domain_suspend_state* by the remus callbacks; this is a
holdover from a much earlier version of the series. =A0It should be
converted to a libxl__save_helper_state and then the dss extracted
with CONTAINER_OF.

2. The way remus works means that the toolstack save callback is
invoked more than once, which the helper's implementation was not
prepared to deal with. =A0Fix this by moving the rewind of the fd into
the helper.

Fixes for these are below. =A0With this, on top of my series, seem to I
get the same behaviour as with the baseline. =A0Would you like to try it ?<= br>

Sure, I ll give it a shot.
=
Btw, my earlier mail was in response to remus not
working on= the baseline setup on your dev environment.


The fix w= orks for 2 out of 3 cases
=A0blackhole replication (xl remus -b)<= /div>
=A0localhost replication with failover i.e. destroy primary (xl r= emus domU localhost)

However, it crashes the guest for localhost replication= , when I destroy the backup
i.e. xl destroy domU--incoming . The = primary guest would generally resume, but in this
case its in --s= c- state.
NB: This seems to happen in baseline xen-unstable too!.=A0
<= br>
xc: error: unexpected PFN mapping failure pfn 180e map_m= fn 43b808 p2m_mfn 43b808: Internal error
libxl: error: libxl_crea= te.c:760:libxl__xc_domain_restore_done: restoring domain: Resource temporar= ily unavailable
libxl: error: libxl_create.c:844:domcreate_rebuild_done: cannot (re-)b= uild domain: -3
libxl: error: libxl.c:1220:libxl_domain_destroy: = non-existant domain 17
libxl: error: libxl_create.c:995:domcreate= _complete: unable to destroy domain 17 following failed creation
migration target: Domain creation failed (code -3).
..
=
Total Data Sent=3D 12.597 MB
libxl: debug: libxl_dom.c:801:l= ibxl__domain_suspend_common_callback: issuing PV suspend request via XenBus= control node
libxl: debug: libxl_dom.c:805:libxl__domain_suspend_common_callback: w= ait for the guest to acknowledge suspend request
libxl: debug: li= bxl_dom.c:852:libxl__domain_suspend_common_callback: guest acknowledged sus= pend request
libxl: debug: libxl_dom.c:856:libxl__domain_suspend_common_callback: w= ait for the guest to suspend
libxl: debug: libxl_dom.c:870:libxl_= _domain_suspend_common_callback: guest has suspended
pagetables= =3D2,cache_misses=3D0,emptypages=3D45
libxl: error: libxl_utils.c:363:libxl_read_exactly: file/stream trunca= ted reading ipc msg header from domain 16 save/restore helper stdout pipe
libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: doma= in 16 save/restore helper [3148] died due to fatal signal Broken pipe
libxl: debug: libxl_event.c:1434:libxl__ao_complete: ao 0x1b08c80: com= plete, rc=3D-3
libxl: debug: libxl_event.c:1406:libxl__ao__destro= y: ao 0x1b08c80: destroy
remus sender: libxl_domain_suspend faile= d (rc=3D-3)
Remus: Backup failed? resuming domain at primary.
xc: debug:= hypercall buffer: total allocations:2116 total releases:2116
xc:= debug: hypercall buffer: current allocations:0 maximum allocations:2
xc: debug: hypercall buffer: cache current size:2
xc: debug:= hypercall buffer: cache hits:1729 misses:2 toobig:385

=

=A0
=A0
Thanks,
Ian.

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index abc5932..069aca1 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -984,7 +984,8 @@ static int libxl__remus_domain_suspend_callback(void *d= ata)

=A0static int libxl__remus_domain_resume_callback(void *data)
=A0{
- =A0 =A0libxl__domain_suspend_state *dss =3D data;
+ =A0 =A0libxl__save_helper_state *shs =3D data;
+ =A0 =A0libxl__domain_suspend_state *dss =3D CONTAINER_OF(shs, *dss, = shs);
=A0 =A0 STATE_AO_GC(dss->ao);

=A0 =A0 /* Resumes the domain and the device model */
@@ -1002,7 +1003,8 @@ static void remus_checkpoint_dm_saved(libxl__egc *egc= ,

=A0static void libxl__remus_domain_checkpoint_callback(void *data)
=A0{
- =A0 =A0libxl__domain_suspend_state *dss =3D data;
+ =A0 =A0libxl__save_helper_state *shs =3D data;
+ =A0 =A0libxl__domain_suspend_state *dss =3D CONTAINER_OF(shs, *dss, shs);=
=A0 =A0 libxl__egc *egc =3D dss->shs.egc;
=A0 =A0 STATE_AO_GC(dss->ao);

diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_call= out.c
index 6332beb..078b7ee 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -105,13 +105,6 @@ void libxl__xc_domain_save(libxl__egc *egc, libxl__dom= ain_suspend_state *dss,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 toolstack_= data_buf, toolstack_data_len,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "= ;toolstack data tmpfile", 0);
=A0 =A0 =A0 =A0 if (r) { rc =3D ERROR_FAIL; goto out; }
-
- =A0 =A0 =A0 =A0r =3D lseek(toolstack_data_fd, 0, SEEK_SET);
- =A0 =A0 =A0 =A0if (r) {
- =A0 =A0 =A0 =A0 =A0 =A0LOGE(ERROR, "rewind toolstack data tmpfile&qu= ot;);
- =A0 =A0 =A0 =A0 =A0 =A0rc =3D ERROR_FAIL;
- =A0 =A0 =A0 =A0 =A0 =A0goto out;
- =A0 =A0 =A0 =A0}
=A0 =A0 }

=A0 =A0 const unsigned long argnums[] =3D {
diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save= _helper.c
index 3bdfa28..772251a 100644
--- a/tools/libxl/libxl_save_helper.c
+++ b/tools/libxl/libxl_save_helper.c
@@ -171,12 +171,14 @@ static int toolstack_save_cb(uint32_t domid, uint8_t = **buf,
=A0{
=A0 =A0 assert(toolstack_save_fd > 0);

+ =A0 =A0int r =3D lseek(toolstack_save_fd, 0, SEEK_SET);
+ =A0 =A0if (r) fail(errno,"rewind toolstack data tmpfile");
+
=A0 =A0 *buf =3D xmalloc(toolstack_save_len);
- =A0 =A0int r =3D read_exactly(toolstack_save_fd, *buf, toolstack_save_len= );
+ =A0 =A0r =3D read_exactly(toolstack_save_fd, *buf, toolstack_save_len); =A0 =A0 if (r<0) fail(errno,"read toolstack data");
=A0 =A0 if (r=3D=3D0) fail(0,"read toolstack data eof"= ;);

- =A0 =A0toolstack_save_fd =3D -1;
=A0 =A0 *len =3D toolstack_save_len;
=A0 =A0 return 0;
=A0}



--000e0ce0d65eb6a3a704c376e9b8-- --===============8834480936510365647== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============8834480936510365647==--