From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shriram Rajagopalan Subject: Re: [PATCH v5 00/21] libxl: domain save/restore: run in a separate process Date: Wed, 27 Jun 2012 12:06:15 -0400 Message-ID: References: <1340733318-21099-1-git-send-email-ian.jackson@eu.citrix.com> <20457.63648.864838.199205@mariner.uk.xensource.com> <20459.3767.440030.931642@mariner.uk.xensource.com> Reply-To: rshriram@cs.ubc.ca Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1979954470247439263==" Return-path: In-Reply-To: <20459.3767.440030.931642@mariner.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Jackson Cc: "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org --===============1979954470247439263== Content-Type: multipart/alternative; boundary=000e0ce041667780fa04c37667d4 --000e0ce041667780fa04c37667d4 Content-Type: text/plain; charset=ISO-8859-1 On Wed, Jun 27, 2012 at 9:46 AM, Ian Jackson wrote: > Shriram Rajagopalan writes ("Re: [PATCH v5 00/21] libxl: domain > save/restore: run in a separate process"): > > Ian, > > The code segfaults. Here are the system details and error traces from > gdb. > > Thanks. > > > My setup: > > > > dom0 : ubuntu 64bit, 2.6.32-39 (pvops kernel), > > running latest xen-4.2-unstable (built from your repo) > > tools stack also built from your repo (which I hope has all > the latest patches). > > > > domU: ubuntu 32bit PV, xenolinux kernel (2.6.32.2 - novel suse version) > > with suspend event channel support > > > > As a sanity check, I tested xl remus with latest tip from xen-unstable > > mercurial repo, c/s: 25496:e08cf97e76f0 > > > > Blackhole replication (to /dev/null) and localhost replication worked as > expected > > and the guest recovered properly without any issues. > > Thanks for the test runes. That didn't work entirely properly for > me, even with the xen-unstable baseline. > > I did this > xl -vvvv remus -b -i 100 debian.guest.osstest dummy >remus.log 2>&1 & > The result was that the guest's networking broke. The guest shows up > in xl list as > debian.guest.osstest 7 512 1 ---ss- > 5.2 > and is still responsive on its pv console. This is normal. You are suspending every 100ms. So, when you see ---ss-, you just ended up doing "xl list" right when the guest was suspended. :) do a xl top and you would see the guest's state oscillate from --b-- to --s-- depending on the checkpoint interval. Or do xl list multiple times. > After I killed the remus > process, the guest's networking was still broken. > > That is strange.. xl remus has literally no networking support on the remus front. So, it shouldnt affect anything in the guest. In fact I repeated your test on my box , where the guest was continuously pinging a host . Pings continued to work. so did ssh. > At the start, the guest prints this on its console: > [ 36.017241] WARNING: g.e. still in use! > [ 36.021056] WARNING: g.e. still in use! > [ 36.024740] WARNING: g.e. still in use! > [ 36.024763] WARNING: g.e. still in use! > > If I try the rune with "localhost" I would have expected, surely, to > see a domain with the incoming migration ? But I don't. I tried > killing the `xl remus' process and the guest became wedged. > > With "-b" option the second argument (localhost|dummy) is ignored. Did you try the command without the -b option, i.e. xl remus -vvv -e domU localhost But I was partially able to reproduce some of your test results without your patches (i.e. on xen-unstable baseline). See end of mail for more details. > However, when I apply my series, I can indeed produce an assertion > failure: > > xc: detail: All memory is saved > xc: error: Could not get domain info (3 = No such process): Internal error > libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume failed > for domain 3077579968: No such process > xl: libxl_event.c:1426: libxl__ao_inprogress_gc: Assertion `ao->magic == > 0xA0FACE00ul' failed. > > So I have indeed made matters worse. > > > > Blackhole replication: > > ================ > > xl error: > > ---------- > > xc: error: Could not get domain info (3 = No such process): Internal > error > > libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume failed > for domain 4154075147: No such process > > libxl: error: libxl_dom.c:1184:libxl__domain_save_device_model: unable > to open qemu save file ?8b: No such file or directory > > I don't see that at all. > > NB that PV guests may have a qemu for certain disk backends, or > consoles, depending on the configuration. Can you show me your domain > config ? Mine is below. > > Ah that explains the qemu related calls. My Guest config: (from tests on 32bit PV domU w/ suspend event channel support) kernel = "/home/kernels/vmlinuz-2.6.32.2-xenu" memory = 1024 name = "xltest2" vcpus = 2 vif = [ 'mac=00:16:3e:00:00:01,bridge=eth0' ] disk = [ 'phy:/dev/drbd1,xvda1,w'] hostname= "rshriram-vm3" root = "/dev/xvda1 ro" extra = "console=xvc0 3" on_poweroff = 'destroy' on_reboot = 'destroy' on_crash = 'coredump-destroy' NB: This guest kernel has suspend-event-channel support which is available in all suse-kernels I suppose. If you would just like to use mine, the source tarball (2.6.32.2 version + kernel config) is at http://aramis.nss.cs.ubc.ca/xenolinux-2.6.32.2.tar.gz > I also ran xl in GDB to get a stack trace and hopefully some useful debug > info. > > gdb traces: http://pastebin.com/7zFwFjW4 > > I get a different crash - see above. > > > Localhost replication: Partial success, but xl still segfaults > > dmesg shows > > [ 1399.254849] xl[4716]: segfault at 0 ip 00007f979483a417 sp > 00007fffe06043e0 error 6 in libxenlight.so.2.0.0[7f9794807000+4d000] > > I see exactly the same thing with `localhost' instead of `dummy'. And > I see no incoming domain. > > I will investigate the crash I see. In the meantime can you try to > help me see why it doesn't work me even with the baseline ? > > I also tested with 64-bit 3.3.0 PV kernel (w/o suspend-event channel support) guest config: kernel = "/home/kernels/vmlinuz-3.3.0-rc1-xenu" memory = 1024 name = "xl-ubuntu-pv64" vcpus = 2 vif = [ 'mac=00:16:3e:00:00:03, bridge=eth0' ] disk = [ 'phy:/dev/vgdrbd/ubuntu-pv64,xvda1,w' ] hostname= "rshriram-vm1" root = "/dev/xvda1 ro" extra = "console=hvc0 3" With xen-unstable baseline, Test 1. Blackhole replication command: nohup xl remus -vvv -e -b -i 100 xl-ubuntu-pv64 dummy >blackhole.log 2>&1 & result: works (networking included) debug output: libxl: debug: libxl_dom.c:687:libxl__domain_suspend_common_callback: issuing PV suspend request via XenBus control node libxl: debug: libxl_dom.c:691:libxl__domain_suspend_common_callback: wait for the guest to acknowledge suspend request libxl: debug: libxl_dom.c:738:libxl__domain_suspend_common_callback: guest acknowledged suspend request libxl: debug: libxl_dom.c:742:libxl__domain_suspend_common_callback: wait for the guest to suspend libxl: debug: libxl_dom.c:754:libxl__domain_suspend_common_callback: guest has suspended caveat: killing remus doesnt do a proper cleanup i.e if you killed it while the domain was suspended, it leaves it in the suspended state (where libxl waits for guest to suspend) Its a pain. In xend/python version, I added a handler (SIGUSR1) , so that one could do pkill -USR1 -f remus and gracefully exit remus, without wedging the domU. * I do not know if adding signal handlers is frowned upon in the xl land :) If there is some protocol in place to handle such things, I would be happy to send a patch that ensures that the guest is "resumed" while doing blackhole replication Test 2. Localhost replication w/ failover by destroying primary VM command: nohup xl remus -vvv -b -i 100 xl-ubuntu-pv64 localhost >blackhole.log 2>&1 & result: works (networking included) --000e0ce041667780fa04c37667d4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Wed, Jun 27, 2012 at 9:46 AM, Ian Jackson <Ian.Jackson@eu.citr= ix.com> wrote:
Shriram Rajagopalan writes ("Re: [PATCH v5 00/21] libxl: domain save/r= estore: run in a separate process"):
> Ian,
> =A0The code segfaults. Here are the system details and error traces fr= om gdb.

Thanks.

> My setup:
>
> dom0 : ubuntu 64bit, 2.6.32-39 (pvops kernel),
> =A0 =A0 =A0 =A0 =A0 =A0running latest xen-4.2-unstable (built from you= r repo)
> =A0 =A0 =A0 =A0 =A0 =A0tools stack also built from your repo (which I = hope has all the latest patches).
>
> domU: ubuntu 32bit PV, xenolinux kernel (2.6.32.2 - novel suse version= )
> =A0 =A0 =A0 =A0 =A0 =A0with suspend event channel support
>
> As a sanity check, I tested xl remus with latest tip from xen-unstable=
> mercurial repo, c/s: 25496:e08cf97e76f0
>
> Blackhole replication (to /dev/null) and localhost replication worked = as expected
> and the guest recovered properly without any issues.

Thanks for the test runes. =A0That didn't work entirely properly = for
me, even with the xen-unstable baseline.

I did this
=A0 xl -vvvv remus -b -i 100 debian.guest.osstest dummy >remus.log 2>= ;&1 &
The result was that the guest's networking broke. =A0The guest shows up=
in xl list as
=A0 debian.guest.osstest =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A07 =A0 = 512 =A0 =A0 1 =A0 =A0 ---ss- =A0 =A0 =A0 5.2
and is still responsive on its pv console. =A0

<= div>This is normal. You are suspending every 100ms. So, when you see ---ss-= ,
you just ended up doing "xl list" right when the gues= t was suspended. :)

do a xl top and you would see the guest's state osc= illate from --b-- to --s--
depending on the checkpoint interval. = Or do xl list multiple times.

=A0
After I killed the remus
process, the guest's networking was still broken.

=A0
That is strange.. =A0xl remus has = literally no networking support on the remus
front. =A0So, it sho= uldnt affect anything in the guest. In fact I repeated your test
on my box , where the guest was continuously pinging a host . Pings co= ntinued
to work. so did ssh.

=A0
At the start, the guest prints this on its console:
=A0[ =A0 36.017241] WARNING: g.e. still in use!
=A0[ =A0 36.021056] WARNING: g.e. still in use!
=A0[ =A0 36.024740] WARNING: g.e. still in use!
=A0[ =A0 36.024763] WARNING: g.e. still in use!

If I try the rune with "localhost" I would have expected, surely,= to
see a domain with the incoming migration ? =A0But I don't. =A0I tried killing the `xl remus' process and the guest became wedged.


With "-b" option the second = argument (localhost|dummy) is ignored. Did you
try the command wi= thout the -b option, i.e.
xl remus -vvv -e domU localhost=A0

But I was partially able to reproduce some of your test= results without your
patches (i.e. on xen-unstable baseline). Se= e end of mail for more details.


However, when I apply my series, I can indeed produce an assertion
failure:

=A0xc: detail: All memory is saved
=A0xc: error: Could not get domain info (3 =3D No such pr= ocess): Internal error
=A0libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume fa= iled for domain 307757996= 8: No such process
=A0xl: libxl_event.c:1426: libxl__ao_inprogress_gc: Assertion `ao->magic= =3D=3D 0xA0FACE00ul' failed.

So I have indeed made matters worse.


> Blackhole replication:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> xl error:
> ----------
> xc: error: Could not get domain info (3 =3D No such process): Internal= error
> libxl: error: libxl.c:388:libxl_domain_resume: xc_domain_resume = failed for domain 4154075= 147<tel:4154075147= >: No such process
> libxl: error: libxl_dom.c:1184:libxl__domain_save_de= vice_model: unable to open qemu save file ?8b: No such file or directory
I don't see that at all.

NB that PV guests may have a qemu for certain disk backends, or
consoles, depending on the configuration. =A0Can you show me your domain config ? =A0Mine is below.


Ah that explai= ns the qemu related calls.=A0

My Guest config: (fr= om tests on 32bit PV domU w/ suspend event channel support)

kernel =3D "/home/kernels/vmlinuz-2.6.32.2-xenu"<= /div>
memory =3D 1024
name =3D "xltest2"
= vcpus =3D 2
vif =3D [ 'mac=3D00:16:3e:00:00:01,bridge=3Deth0&= #39; ]
disk =3D [ 'phy:/dev/drbd1,xvda1,w']
hostname=3D &qu= ot;rshriram-vm3"
root =3D "/dev/xvda1 ro"
extra =3D "console=3Dxvc0 3"
on_poweroff =3D 'des= troy'
on_reboot =A0 =3D 'destroy'
on_crash =A0 =A0=3D '= ;coredump-destroy'

NB: This guest kernel= has suspend-event-channel support
which is available in all suse= -kernels I suppose. If you would
just like to use mine,=A0the source tarball (2.6.32.2 version + kernel= config)


> I also ran xl in GDB to get a stack trace and hopefully some useful de= bug info.
> gdb traces: http://pastebin.com/7zFwFjW4

I get a different crash - see above.

> Localhost replication: Partial success, but xl still segfaults
> =A0dmesg shows
> =A0[ 1399.254849] xl[4716]: segfault at 0 ip 00007f979483a417 sp 00007= fffe06043e0 error 6 in libxenlight.so.2.0.0[7f9794807000+4d000]

I see exactly the same thing with `localhost' instead of `dummy&#= 39;. =A0And
I see no incoming domain.

I will investigate the crash I see. =A0In the meantime can you try to
help me see why it doesn't work me even with the baseline ?



I also tested with 64-b= it 3.3.0 PV kernel (w/o suspend-event channel support)

=
guest config:
kernel =3D "/home/kernels/vmlinuz-3.= 3.0-rc1-xenu"
memory =3D 1024
name =3D "xl-ubuntu-pv64"
vcpus =3D 2
vif =3D [ 'mac=3D00:16:3e:00:00:03, bridge=3Det= h0' ]
disk =3D [ 'phy:/dev/vgdrbd/ubuntu-pv64,xvda1,w'= ; ]
hostname=3D "rshriram-vm1"
root =3D "/dev/xvda1 ro= "
extra =3D "console=3Dhvc0 3"
With xen-unstable baseline,
Test 1. Blackhole replica= tion
=A0command: nohup xl remus -vvv -e -b -i 100 xl-ubuntu-pv64 dummy >= blackhole.log 2>&1 &
=A0result: works (networking incl= uded)
debug output:
libxl: debug: libxl_dom.c:687:= libxl__domain_suspend_common_callback: issuing PV suspend request via XenBu= s control node
libxl: debug: libxl_dom.c:691:libxl__domain_suspend_common_callback: w= ait for the guest to acknowledge suspend request
libxl: debug: li= bxl_dom.c:738:libxl__domain_suspend_common_callback: guest acknowledged sus= pend request
libxl: debug: libxl_dom.c:742:libxl__domain_suspend_common_callback: w= ait for the guest to suspend
libxl: debug: libxl_dom.c:754:libxl_= _domain_suspend_common_callback: guest has suspended

=A0caveat: killing remus doesnt do a proper cleanup i.e if you k= illed it while the domain was
=A0 =A0 =A0 =A0 =A0 =A0 =A0suspende= d, it leaves it in the suspended state (where libxl waits for guest to susp= end)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 Its a pain. In xend/python version, I=A0added a= handler (SIGUSR1) , so that one could do
=A0 =A0 =A0 =A0 =A0 =A0= =A0pkill -USR1 -f remus and gracefully exit remus, without wedging the dom= U.

=A0 =A0 =A0 =A0 =A0 =A0 =A0* I do not know if adding signal handlers is fro= wned upon in the xl land :)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0If the= re is some protocol in place to handle such things, I would be happy to sen= d
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0a patch that ensures that the gu= est is "resumed" while doing blackhole replication

Test 2. Localhost replication w/ failover by destroying= primary VM
=A0command: nohup xl remus -vvv -b -i 100 xl-ubu= ntu-pv64 localhost >blackhole.log 2>&1 &
=A0result:= works (networking included)

--000e0ce041667780fa04c37667d4-- --===============1979954470247439263== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============1979954470247439263==--