From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Olsowski Subject: Re: slow live magration / xc_restore on xen4 pvops Date: Wed, 2 Jun 2010 17:46:45 +0200 Message-ID: <20100602174645.9b37b6b1.andreas.olsowski@uni.leuphana.de> References: <4C0578EB.2040800@uni.leuphana.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Hi Keir, i changed all DRPRINTF calls to ERROR and // DPRINTF to ERROR as well. There are no DBGPRINTF calls in my xc_domain_restore.c though. This is the new xend.log output, of course in this case the "ERROR Internal= error:" is actually debug output. xenturio1:~# tail -f /var/log/xen/xend.log [2010-06-02 15:44:19 5468] DEBUG (XendCheckpoint:286) restore:shadow=3D0x0,= _static_max=3D0x20000000, _static_min=3D0x0, [2010-06-02 15:44:19 5468] DEBUG (XendCheckpoint:305) [xc_restore]: /usr/li= b/xen/bin/xc_restore 50 51 1 2 0 0 0 0 [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: = xc_domain_restore start: p2m_size =3D 20000 [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: = Reloading memory pages: 0% [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: = reading batch of -7 pages [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) ERROR Internal error: = reading batch of 1024 pages [2010-06-02 15:44:19 5468] INFO (XendCheckpoint:423) [2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423) ERROR Internal error: = reading batch of 1024 pages [2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423) [2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423) ERROR Internal error: = reading batch of 1024 pages [2010-06-02 15:49:02 5468] INFO (XendCheckpoint:423) [2010-06-02 15:49:03 5468] INFO (XendCheckpoint:423) ERROR Internal error: = reading batch of 1024 pages ... [2010-06-02 15:49:09 5468] INFO (XendCheckpoint:423) ERROR Internal err100% ... One can see the timegap bewteen the first and the following memory batch re= ads. After that restoration works as expected. You might notice, that you have "0%" and then "100%" and no steps inbetween= , whereas with xc_save you have, is that intentional or maybe another sympt= om for the same problem? as for the read_exact stuff: tarballerina:/usr/src/xen-4.0.0# find . -type f -iname \*.c -exec grep -H R= DEXACT {} \; tarballerina:/usr/src/xen-4.0.0# find . -type f -iname \*.c -exec grep -H r= dexact {} \; There are no RDEXACT/rdexact matches in my xen source code. In a few hours i will shutdown all virtual machines on one of the hosts exp= eriencing slow xc_restores, maybe reboot it and check if xc_restore is any = faster without load or utilization on the machine. Ill check in with results later. On Wed, 2 Jun 2010 08:11:31 +0100 Keir Fraser wrote: > Hi Andreas, >=20 > This is an interesting bug, to be sure. I think you need to modify the > restore code to get a better idea of what's going on. The file in the Xen > tree is tools/libxc/xc_domain_restore.c. You will see it contains many > DBGPRINTF and DPRINTF calls, some of which are commented out, and some of > which may 'log' at too low a priority level to make it to the log file. F= or > your purposes you might change them to ERROR calls as they will definitely > get properly logged. One area of possible concern is that our read functi= on > (RDEXACT, which is a macro mapping to rdexact) was modified for Remus to > have a select() call with a timeout of 1000ms. Do I entirely trust it? Not > when we have the inexplicable behaviour that you're seeing. So you might = try > mapping RDEXACT() to read_exact() instead (which is what we already do wh= en > building for __MINIOS__). >=20 > This all assumes you know your way around C code at least a little bit. >=20 > -- Keir --=20 Andreas Olsowski Leuphana Universit=E4t L=FCneburg System- und Netzwerktechnik Rechenzentrum, Geb 7, Raum 15 Scharnhorststr. 1 21335 L=FCneburg Tel: ++49 4131 / 6771309