From mboxrd@z Thu Jan  1 00:00:00 1970
From: Brendan Cully <brendan@cs.ubc.ca>
Subject: Re: Remus : VM on backup not in pause state
Date: Wed, 18 Aug 2010 13:10:49 -0700
Message-ID: <20100818201049.GC2411@kremvax.cs.ubc.ca>
References: <AANLkTimMdbYpbLEMJooOTwdq-XpjLBmEVNBlnneuPoXz@mail.gmail.com>
	<20100722214913.GE3994@kremvax.cs.ubc.ca>
	<AANLkTimV1PGPes9CZfyyg6rbpH_BotkoL7VQwZhWWosf@mail.gmail.com>
	<20100726220526.GA19006@kremvax.cs.ubc.ca>
	<AANLkTineFTYp-R5S4_Y0-NNPSvu8ftot_6kRScuaFc+B@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
Content-Disposition: inline
In-Reply-To: <AANLkTineFTYp-R5S4_Y0-NNPSvu8ftot_6kRScuaFc+B@mail.gmail.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Dulloor <dulloor@gmail.com>
Cc: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

I've reproduced this on unstable (I'd been running the 4.0 tree
previously). The cause was a well-intentioned but broken patch by Ian
Jackson (21488:dd6bbdc42033). I've just mailed the fix to xen-devel.

On Monday, 26 July 2010 at 23:17, Dulloor wrote:
> Thanks for the pointers. I haven't had time to work on this. I will
> collect more data and get back as soon as I can.
>=20
> -dulloor
>=20
> On Mon, Jul 26, 2010 at 3:05 PM, Brendan Cully <brendan@cs.ubc.ca> wrot=
e:
> > On Thursday, 22 July 2010 at 16:40, Dulloor wrote:
> >> On Thu, Jul 22, 2010 at 2:49 PM, Brendan Cully <brendan@cs.ubc.ca> w=
rote:
> >> > On Thursday, 22 July 2010 at 13:45, Dulloor wrote:
> >> >> My setup is as follows :
> >> >> - xen : unstable (rev:21743)
> >> >> - Dom0 : pvops (branch : stable-2.6.32.x,
> >> >> rev:01d9fbca207ec232c758d991d66466fc6e38349e)
> >> >> - Guest Configuration :
> >> >> -----------------------------------------------------------------=
-------------------------
> >> >> kernel =3D "/usr/lib/xen/boot/hvmloader"
> >> >> builder=3D'hvm'
> >> >> name =3D "linux-hvm"
> >> >> vcpus =3D 4
> >> >> memory =3D 2048
> >> >> vif =3D [ 'type=3Dioemu, bridge=3Deth0, mac=3D00:1c:3e:17:22:13' =
]
> >> >> disk =3D [ 'phy:/dev/XenVolG/hvm-linux-snap-1.img,hda,w' ]
> >> >> device_model =3D '/usr/lib/xen/bin/qemu-dm'
> >> >> boot=3D"cd"
> >> >> sdl=3D0
> >> >> vnc=3D1
> >> >> vnclisten=3D"0.0.0.0"
> >> >> vncconsole=3D0
> >> >> vncpasswd=3D''
> >> >> stdvga=3D0
> >> >> superpages=3D1
> >> >> serial=3D'pty'
> >> >> -----------------------------------------------------------------=
-------------------------
> >> >>
> >> >> - Remus command :
> >> >> # remus --no-net linux-hvm <dst-ip>
> >> >>
> >> >> - On primary :
> >> >> # xm list
> >> >> Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0ID =A0 Mem VCPUs =A0 =A0 =A0State =A0 Time(s)
> >> >> linux-hvm =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A09 =A02048 =A0 =A0 4 =A0 =A0 -b-s-- =A0 =A0 10.8
> >> >>
> >> >> - On secondary :
> >> >> # xm list
> >> >> Name =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0ID =A0 Mem VCPUs =A0 =A0 =A0State =A0 Time(s)
> >> >> linux-hvm =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 11 =A02048 =A0 =A0 4 =A0 =A0 -b---- =A0 =A0 =A01.9
> >> >>
> >> >>
> >> >> I have to issue "xm pause/unpause" explicitly for the backup VM.
> >> >> Any recent changes ?
> >> >
> >> > This probably means there was a timeout on the replication channel=
,
> >> > interpreted by the backup as a failure of the primary, which cause=
d it
> >> > to activate itself. You should see evidence of that in the remus
> >> > console logs and xend.log and daemon.log (for the disk side).
> >> >
> >> > Once you've figured out where the timeout happened it'll be easier=
 to
> >> > figure out why.
> >> >
> >> Please find the logs attached. I didn't find anything interesting in
> >> daemon.log.
> >> What does remus log there ? I am not using disk replication, since I
> >> have issues with that .. but that's for another email :)
> >
> > daemon.log is just for disk replication, so if you're not using it yo=
u
> > won't see anything.
> >
> >> The only visible error is in xend-secondary.log around xc_restore :
> >> [2010-07-22 16:15:37 2056] DEBUG (balloon:207) Balloon: setting dom0=
 target to 5
> >> 765 MiB.
> >> [2010-07-22 16:15:37 2056] DEBUG (XendDomainInfo:1467) Setting memor=
y target of
> >> domain Domain-0 (0) to 5765 MiB.
> >> [2010-07-22 16:15:37 2056] DEBUG (XendCheckpoint:290) [xc_restore]: =
/usr/lib/xen
> >> /bin/xc_restore 5 1 5 6 1 1 1 0
> >> [2010-07-22 16:18:42 2056] INFO (XendCheckpoint:408) xc: error: Erro=
r
> >> when reading pages (11 =3D Resource temporarily unavailabl): Interna=
l
> >> error
> >> [2010-07-22 16:18:42 2056] INFO (XendCheckpoint:408) xc: error: erro=
r
> >> when buffering batch, finishing (11 =3D Resource temporarily
> >> unavailabl): Internal error
> >>
> >> If you haven't seen this before, please let me know and I will try
> >> debugging more.
> >
> > I haven't seen that. It looks like read_exact_timed has failed with
> > EAGAIN, which is surprising since it explicitly looks for EAGAIN and
> > loops on it. Can you log len and errno after line 77 in
> > read_exact_timed in tools/libxc/xc_domain_restore.c? ie change
> >
> > =A0 =A0 =A0 if ( len <=3D 0 )
> > =A0 =A0 =A0 =A0 =A0 =A0return -1;
> >
> > to something like
> >
> > =A0 if ( len <=3D 0 ) {
> > =A0 =A0 =A0 fprintf(stderr, "read_exact_timed failed (read rc: %d, er=
rno: %d)\n",
> > =A0 =A0 =A0 len, errno);
> > =A0 =A0 =A0 return -1;
> > =A0 }
> >
> > Another possibility is read is returning 0 here (and EAGAIN is just a
> > leftover errno from a previous read), which would indicate that the
> > _sender_ hung up the connection. It's hard to tell exactly what's
> > going on because you seem to have an enormous amount of clock skew
> > between your primary and secondary dom0s and I can't tell whether the
> > logs match up.
> >
>=20
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>=20