From mboxrd@z Thu Jan 1 00:00:00 1970 From: Brendan Cully Subject: Re: Remus : VM on backup not in pause state Date: Wed, 18 Aug 2010 13:10:49 -0700 Message-ID: <20100818201049.GC2411@kremvax.cs.ubc.ca> References: <20100722214913.GE3994@kremvax.cs.ubc.ca> <20100726220526.GA19006@kremvax.cs.ubc.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Dulloor Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org I've reproduced this on unstable (I'd been running the 4.0 tree previously). The cause was a well-intentioned but broken patch by Ian Jackson (21488:dd6bbdc42033). I've just mailed the fix to xen-devel. On Monday, 26 July 2010 at 23:17, Dulloor wrote: > Thanks for the pointers. I haven't had time to work on this. I will > collect more data and get back as soon as I can. >=20 > -dulloor >=20 > On Mon, Jul 26, 2010 at 3:05 PM, Brendan Cully wrot= e: > > On Thursday, 22 July 2010 at 16:40, Dulloor wrote: > >> On Thu, Jul 22, 2010 at 2:49 PM, Brendan Cully w= rote: > >> > On Thursday, 22 July 2010 at 13:45, Dulloor wrote: > >> >> My setup is as follows : > >> >> - xen : unstable (rev:21743) > >> >> - Dom0 : pvops (branch : stable-2.6.32.x, > >> >> rev:01d9fbca207ec232c758d991d66466fc6e38349e) > >> >> - Guest Configuration : > >> >> -----------------------------------------------------------------= ------------------------- > >> >> kernel =3D "/usr/lib/xen/boot/hvmloader" > >> >> builder=3D'hvm' > >> >> name =3D "linux-hvm" > >> >> vcpus =3D 4 > >> >> memory =3D 2048 > >> >> vif =3D [ 'type=3Dioemu, bridge=3Deth0, mac=3D00:1c:3e:17:22:13' = ] > >> >> disk =3D [ 'phy:/dev/XenVolG/hvm-linux-snap-1.img,hda,w' ] > >> >> device_model =3D '/usr/lib/xen/bin/qemu-dm' > >> >> boot=3D"cd" > >> >> sdl=3D0 > >> >> vnc=3D1 > >> >> vnclisten=3D"0.0.0.0" > >> >> vncconsole=3D0 > >> >> vncpasswd=3D'' > >> >> stdvga=3D0 > >> >> superpages=3D1 > >> >> serial=3D'pty' > >> >> -----------------------------------------------------------------= ------------------------- > >> >> > >> >> - Remus command : > >> >> # remus --no-net linux-hvm > >> >> > >> >> - On primary : > >> >> # xm list > >> >> Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0ID =A0 Mem VCPUs =A0 =A0 =A0State =A0 Time(s) > >> >> linux-hvm =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A09 =A02048 =A0 =A0 4 =A0 =A0 -b-s-- =A0 =A0 10.8 > >> >> > >> >> - On secondary : > >> >> # xm list > >> >> Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0ID =A0 Mem VCPUs =A0 =A0 =A0State =A0 Time(s) > >> >> linux-hvm =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 11 =A02048 =A0 =A0 4 =A0 =A0 -b---- =A0 =A0 =A01.9 > >> >> > >> >> > >> >> I have to issue "xm pause/unpause" explicitly for the backup VM. > >> >> Any recent changes ? > >> > > >> > This probably means there was a timeout on the replication channel= , > >> > interpreted by the backup as a failure of the primary, which cause= d it > >> > to activate itself. You should see evidence of that in the remus > >> > console logs and xend.log and daemon.log (for the disk side). > >> > > >> > Once you've figured out where the timeout happened it'll be easier= to > >> > figure out why. > >> > > >> Please find the logs attached. I didn't find anything interesting in > >> daemon.log. > >> What does remus log there ? I am not using disk replication, since I > >> have issues with that .. but that's for another email :) > > > > daemon.log is just for disk replication, so if you're not using it yo= u > > won't see anything. > > > >> The only visible error is in xend-secondary.log around xc_restore : > >> [2010-07-22 16:15:37 2056] DEBUG (balloon:207) Balloon: setting dom0= target to 5 > >> 765 MiB. > >> [2010-07-22 16:15:37 2056] DEBUG (XendDomainInfo:1467) Setting memor= y target of > >> domain Domain-0 (0) to 5765 MiB. > >> [2010-07-22 16:15:37 2056] DEBUG (XendCheckpoint:290) [xc_restore]: = /usr/lib/xen > >> /bin/xc_restore 5 1 5 6 1 1 1 0 > >> [2010-07-22 16:18:42 2056] INFO (XendCheckpoint:408) xc: error: Erro= r > >> when reading pages (11 =3D Resource temporarily unavailabl): Interna= l > >> error > >> [2010-07-22 16:18:42 2056] INFO (XendCheckpoint:408) xc: error: erro= r > >> when buffering batch, finishing (11 =3D Resource temporarily > >> unavailabl): Internal error > >> > >> If you haven't seen this before, please let me know and I will try > >> debugging more. > > > > I haven't seen that. It looks like read_exact_timed has failed with > > EAGAIN, which is surprising since it explicitly looks for EAGAIN and > > loops on it. Can you log len and errno after line 77 in > > read_exact_timed in tools/libxc/xc_domain_restore.c? ie change > > > > =A0 =A0 =A0 if ( len <=3D 0 ) > > =A0 =A0 =A0 =A0 =A0 =A0return -1; > > > > to something like > > > > =A0 if ( len <=3D 0 ) { > > =A0 =A0 =A0 fprintf(stderr, "read_exact_timed failed (read rc: %d, er= rno: %d)\n", > > =A0 =A0 =A0 len, errno); > > =A0 =A0 =A0 return -1; > > =A0 } > > > > Another possibility is read is returning 0 here (and EAGAIN is just a > > leftover errno from a previous read), which would indicate that the > > _sender_ hung up the connection. It's hard to tell exactly what's > > going on because you seem to have an enormous amount of clock skew > > between your primary and secondary dom0s and I can't tell whether the > > logs match up. > > >=20 > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >=20