From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andreas Olsowski <andreas.olsowski@uni.leuphana.de>
Subject: Re: slow live magration / xc_restore on xen4 pvops
Date: Wed, 2 Jun 2010 17:46:45 +0200
Message-ID: <20100602174645.9b37b6b1.andreas.olsowski@uni.leuphana.de>
References: <4C0578EB.2040800@uni.leuphana.de>
	<C82BC2B3.166A7%keir.fraser@eu.citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <C82BC2B3.166A7%keir.fraser@eu.citrix.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

Hi Keir,

i changed all DRPRINTF calls to ERROR and // DPRINTF to ERROR as well.
There are no DBGPRINTF calls in my xc_domain_restore.c though.

This is the new xend.log output, of course in this case the "ERROR Internal=
 error:" is actually debug output.

xenturio1:~# tail -f /var/log/xen/xend.log
[2010-06-02 15:44:19 5468] DEBUG (XendCheckpoint:286) restore:shadow=3D0x0,=
 _static_max=3D0x20000000, _static_min=3D0x0,
[2010-06-02 15:44:19 5468] DEBUG (XendCheckpoint:305) [xc_restore]: /usr/li=
b/xen/bin/xc_restore 50 51 1 2 0 0 0 0
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
xc_domain_restore start: p2m_size =3D 20000
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423)
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
Reloading memory pages:   0%
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423)
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
reading batch of -7 pages
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423)
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
reading batch of 1024 pages
[2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423)
[2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
reading batch of 1024 pages
[2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423)
[2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
reading batch of 1024 pages
[2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423)
[2010-06-02 15:49:03 5468] INFO (XendCheckpoint:423) ERROR Internal error: =
reading batch of 1024 pages
...
[2010-06-02 15:49:09 5468] INFO (XendCheckpoint:423) ERROR Internal err100%
...

One can see the timegap bewteen the first and the following memory batch re=
ads.
After that restoration works as expected.
You might notice, that you have "0%" and then "100%" and no steps inbetween=
, whereas with xc_save you have, is that intentional or maybe another sympt=
om for the same problem?

as for the read_exact stuff:
tarballerina:/usr/src/xen-4.0.0# find . -type f -iname \*.c -exec grep -H R=
DEXACT {} \;
tarballerina:/usr/src/xen-4.0.0# find . -type f -iname \*.c -exec grep -H r=
dexact {} \;

There are no RDEXACT/rdexact matches in my xen source code.

In a few hours i will shutdown all virtual machines on one of the hosts exp=
eriencing slow xc_restores, maybe reboot it and check if xc_restore is any =
faster without load or utilization on the machine.

Ill check in with results later.


On Wed, 2 Jun 2010 08:11:31 +0100
Keir Fraser <keir.fraser@eu.citrix.com> wrote:

> Hi Andreas,
>=20
> This is an interesting bug, to be sure. I think you need to modify the
> restore code to get a better idea of what's going on. The file in the Xen
> tree is tools/libxc/xc_domain_restore.c. You will see it contains many
> DBGPRINTF and DPRINTF calls, some of which are commented out, and some of
> which may 'log' at too low a priority level to make it to the log file. F=
or
> your purposes you might change them to ERROR calls as they will definitely
> get properly logged. One area of possible concern is that our read functi=
on
> (RDEXACT, which is a macro mapping to rdexact) was modified for Remus to
> have a select() call with a timeout of 1000ms. Do I entirely trust it? Not
> when we have the inexplicable behaviour that you're seeing. So you might =
try
> mapping RDEXACT() to read_exact() instead (which is what we already do wh=
en
> building for __MINIOS__).
>=20
> This all assumes you know your way around C code at least a little bit.
>=20
>  -- Keir


--=20
Andreas Olsowski <andreas.olsowski@uni.leuphana.de>
Leuphana Universit=E4t L=FCneburg
System- und Netzwerktechnik
Rechenzentrum, Geb 7, Raum 15
Scharnhorststr. 1
21335 L=FCneburg

Tel: ++49 4131 / 6771309