From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Olsowski Subject: slow live magration / xc_restore on xen4 pvops Date: Tue, 01 Jun 2010 23:17:31 +0200 Message-ID: <4C0578EB.2040800@uni.leuphana.de> References: <2FD61F37AFF16D4DB46149330E4273C702FF9687@dcl-ex.dcml.docomolabs-usa.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <2FD61F37AFF16D4DB46149330E4273C702FF9687@dcl-ex.dcml.docomolabs-usa.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Hi, in preparation for our soon to arrive central storage array i wanted to=20 test live magration and remus replication and stumbled upon a problem. When migrating a test-vm (512megs ram, idle) between my 3 servers two of=20 them are extremely slow in "receiving" the vm. There is little to no cpu=20 utilization from xc_restore until shortly before migration is complete. The same goes for xm restore. The xend.log contains: [2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:286)=20 restore:shadow=3D0x0, _static_max=3D0x20000000, _static_min=3D0x0, [2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:305) [xc_restore]:=20 /usr/lib/xen/bin/xc_restore 48 43 1 2 0 0 0 0 [2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) xc_domain_restore=20 start: p2m_size =3D 20000 [2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) Reloading memory=20 pages: 0% [2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internal=20 error: Error when reading batch size [2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internal=20 error: error when buffering batch, finishing When receiving a vm via live migration finally finishes. You can see the=20 large gap in the timestamps. The vm is perfectly fine after that, it just takes way too long. First off let me explain my server setup, detailed information on trying=20 to narrow down the error follows. I have 3 servers running xen4 with 2.6.31.13-pvops as kernel, its the=20 current kernel from jeremy's xen/master git branch. The guests are running vanilla 2.6.32.11 kernels. The 3 servers differ slightly in hardware, two are Dell PE 2950 and one=20 is a Dell R710, the 2950's have 2 Quad-Xeon CPUs (L5335 and L5410), the=20 R710 has 2 Quad Xeon E5520. All machines have 24gigs of RAM. They are called "tarballerina" (E5520), "xentruio1" (L5335) ad=20 "xenturio2" (L5410). Currently i use tarballerina for testing purposes but i dont consider=20 anything in my setup "stable". xenturio1 has 27 guests running, xenturio2 25. No guest does anything that would even put a dent into the systems=20 performance (ldap servers, radius, department webservers, etc.). I created a test-vm on my current central iscsi storage, called "hatest"=20 that idles around, has 2 VCPUs and 512megs of ram. First i testen xm save/restore: tarballerina:~# time xm restore /var/saverestore-t.mem real 0m13.227s user 0m0.090s sys 0m0.023s xenturio1:~# time xm restore /var/saverestore-x1.mem real 4m15.173s user 0m0.138s sys 0m0.029s When migrating to xenturio1 or 2 it the migration takes 181 to 278=20 seconds, when migrating it to tarballerina it takes rougly 30seconds: tarballerina:~# time xm migrate --live hatest 10.0.1.98 real 3m57.971s user 0m0.086s sys 0m0.029s xenturio1:~# time xm migrate --live hatest 10.0.1.100 real 0m43.588s user 0m0.123s sys 0m0.034s --- attempt of narrowing it down ---- My first guess was that since tarballerina had almost no guest running=20 that did anything, it could be a issue of memory usage by the tapdisk2=20 processes (each dom0 has been mem-set to 4096M). I then started almost all vms that i have on tarballerina: tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem real 0m2.884s tarballerina:~# time xm restore /var/saverestore-t.mem real 0m15.594s i tried this several times, sometimes it too 30+ seconds. Then i started 2 VMs that run load and io generating processes (stress,=20 dd, openssl encryption, md5sum). But this didnt affect xm restore perfomance, it still was quite fast: tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem real 0m7.476s user 0m0.101s sys 0m0.022s tarballerina:~# time xm restore /var/saverestore-t.mem real 0m45.544s user 0m0.094s sys 0m0.022s i tried several times again, restore took 17 to 45 seconds Then i tried migrating the test-vm to tarballerina again, still fast,=20 inspite of several vms including load and io generating vms: This ate almost all available ram. cputimes for xc_restore according to target machine's "top": tarballerina -> xenturio1: 0:05:xx , cpu 2-4%, near the end 40%. xenturio1 > tarballerina: 0:04:xx, cpu 4-8%, near the end 54%. tarballerina:~# time xm migrate --live hatest 10.0.1.98 real 3m29.779s user 0m0.102s sys 0m0.017s xenturio1:~# time xm migrate --live hatest 10.0.1.100 real 0m28.386s user 0m0.154s sys 0m0.032s so my attempt of narrowing the problem down failed, its neither the free=20 memory of the dom0 nor the load, io or the memory the other domUs utilize= . ---end attempt--- More info(xm list, meminfo, table with migration times, etc.) on my=20 setup can be found here: http://andiolsi.rz.uni-lueneburg.de/node/37 There was another guy who has the same error in his logfile, this might=20 be unrelated or not: http://lists.xensource.com/archives/html/xen-users/2010-05/msg00318.html Further information can be given, should demand for i arise. With best regards --- Andreas Olsowski Leuphana Universit=E4t L=FCneburg System- und Netzwerktechnik Rechenzentrum, Geb 7, Raum 15 Scharnhorststr. 1 21335 L=FCneburg Tel: ++49 4131 / 6771309