From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52006) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZKQlt-0001N3-UA for qemu-devel@nongnu.org; Wed, 29 Jul 2015 08:47:52 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZKQlo-0007Jy-QU for qemu-devel@nongnu.org; Wed, 29 Jul 2015 08:47:45 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:37470) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZKQlo-0007JE-Gq for qemu-devel@nongnu.org; Wed, 29 Jul 2015 08:47:40 -0400 Received: by wibud3 with SMTP id ud3so24620636wib.0 for ; Wed, 29 Jul 2015 05:47:38 -0700 (PDT) Date: Wed, 29 Jul 2015 14:47:36 +0200 From: Eduardo Otubo Message-ID: <20150729124736.GA24261@vader> References: <20150728132213.GA1603@vader> <20150728151946.GF2247@work-vm> <20150729080303.GA7667@vader> <20150729081121.GA2267@work-vm> <20150729084104.GB7667@vader> <20150729093259.GD2267@work-vm> <20150729100908.GA22821@vader> <20150729103843.GM2267@work-vm> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="5mCyUwZo2JvN/JJP" Content-Disposition: inline In-Reply-To: <20150729103843.GM2267@work-vm> Subject: Re: [Qemu-devel] Live migration hangs after migration to remote host List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: Qemu-devel --5mCyUwZo2JvN/JJP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 29, 2015 at 11=3D38=3D44AM +0100, Dr. David Alan Gilbert wrote: > * Eduardo Otubo (eduardo.otubo@profitbricks.com) wrote: > > On Wed, Jul 29, 2015 at 10=3D32=3D59AM +0100, Dr. David Alan Gilbert wr= ote: > > > * Eduardo Otubo (eduardo.otubo@profitbricks.com) wrote: > > > > On Wed, Jul 29, 2015 at 09=3D11=3D21AM +0100, Dr. David Alan Gilber= t wrote: > > > > > * Eduardo Otubo (eduardo.otubo@profitbricks.com) wrote: > > > > > > On Tue, Jul 28, 2015 at 04=3D19=3D46PM +0100, Dr. David Alan Gi= lbert wrote: > > > > > > > * Eduardo Otubo (eduardo.otubo@profitbricks.com) wrote: > > > > > > > > Hello all, > > > > > > > >=20 > > > > > > > > I'm facing a weird behavior on my tests: I am able to live = migrate > > > > > > > > between two virtual machines on my localhost, but not to an= other > > > > > > > > machine, both using tcp. > > > > > > > >=20 > > > > > > > > * I am using the same arguments on the command line; > > > > > > > > * Both virtual machines uses the same qcow2 file visible th= rough NFS; > > > > > > > > * Both machines are in the same subnet; > > > > > > > > * Migration is being done from intel to intel; > > > > > > > > * Same version of Qemu (github master - f8787f8723); > > > > > > > >=20 > > > > > > > > Using all above I am able to live migrate on the same host:= between two > > > > > > > > vms on local host or between two vms in the remote host; bu= t when > > > > > > > > migrating from local to remote, the guest hangs. I still ca= n access its > > > > > > > > console via ctrl+alt+2, though, and everything seems to be = normal. If I > > > > > > > > issue a reboote via console on the remote, the guest gets b= ack to > > > > > > > > normal. > > > > > > > >=20 > > > > > > > > Am I missing something here? > > > > > > >=20 > > > > > > > Just checking, but are you saying that as far as qemu is conc= erned, the migration > > > > > > > is happy, it's just the guest that's hung? > > > > > >=20 > > > > > > That's exactly the case. The console (via ctrl+alt+2) is active= and > > > > > > responding to all commands normally, but the screen (ctrl+alt+1= ) is > > > > > > frozen and I can't interact with it at all. > > > > >=20 > > > > > Are you driving this via libvirt or using qemu monitor directly? > > > > > If the latter, can you please get an 'info migrate' from the sour= ce > > > > > and an 'info status' from the destination at the end of migrate. > > > >=20 > > > > I'm using qemu command line directly. And I got the problem :) See > > > > below. > > > >=20 > > > > >=20 > > > > > > > Are the host clocks on the two hosts very close (there are lo= ts of > > > > > > > weird corner cases with mismatched clocks) - same time zone? > > > > > >=20 > > > > > > Yep. Both machines are in the same room and have the clock sync= 'ed. > > > > >=20 > > > > > OK, good. > > > > >=20 > > > > > > >=20 > > > > > > > Are you using cache=3Dnone (given that it's NFS shared) > > > > > >=20 > > > > > > I wasn't. But I tried again with cache=3Dnone and I got exactly= the same > > > > > > thing. > > > > >=20 > > > > > OK, and this pair of machines, have you tried both directions - i= =2Ee. > > > > > going a->b and b->a - do both directions fail? > > > > > Is the NFS server one of the two machines? If it is, and you're = using libvirt, > > > > > make sure that the directory the disks are on is an NFS mount on = both > > > > > machines; e.g. don't migrate directly from the NFS export. > > > > >=20 > > > > > > Also, I tried with stable-2.2 branch and got the same behavior.= I really > > > > > > think that's very unlikely to have unstable code of such an imp= ortant > > > > > > feature upstream, or on a stable- branch. Most probable thing i= s that > > > > > > I have something wrong on my environment. > > > > >=20 > > > > > Yes, the challenge is to find what; and if it's something common > > > > > we should try and find a way of spotting it. > > > > >=20 > > > > > > Anyway, I'll keep tetsing different stable- branches until I fi= nd > > > > > > something that works for me. I'll keep the mailing list posted. > > > > >=20 > > > > > Could you share the qemu command line so we can see if we can > > > > > spot anything? > > > >=20 > > > > Got the problem! I tried to simplify my qemu command line to the > > > > smallest possible, excluding things I thought it could cause the is= sue. > > > > With no further due, this is the argument: > > > >=20 > > > > -cpu 'Opteron_G4' > > > >=20 > > > > Without this argument everything works as it should, console respon= sive > > > > and guest active :) > > >=20 > > > Can you show cat /proc/cpuinfo off the two hosts? > > > (Only one CPU, but please include the whole entry) > >=20 > > Intel host: > > ssor : 7 > > vendor_id : GenuineIntel > > cpu family : 6 > > model : 60 > > model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz > > stepping : 3 > > microcode : 0x1c > > cpu MHz : 883.468 > > cache size : 8192 KB > > physical id : 0 > > siblings : 8 > > core id : 3 > > cpu cores : 4 > > apicid : 7 > > initial apicid : 7 > > fpu : yes > > fpu_exception : yes > > cpuid level : 13 > > wp : yes > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge = mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall = nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopo= logy nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vm= x smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popc= nt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pl= n pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1= avx2 smep bmi2 erms invpcid xsaveopt > > bugs : > > bogomips : 6784.87 > > clflush size : 64 > > cache_alignment : 64 > > address sizes : 39 bits physical, 48 bits virtual > > power management: > >=20 > > AMD host: > > processor : 5 > > vendor_id : AuthenticAMD > > cpu family : 16 > > model : 10 > > model name : AMD Phenom(tm) II X6 1075T Processor > > stepping : 0 > > microcode : 0x10000bf > > cpu MHz : 800.000 > > cache size : 512 KB > > physical id : 0 > > siblings : 6 > > core id : 5 > > cpu cores : 6 > > apicid : 5 > > initial apicid : 5 > > fpu : yes > > fpu_exception : yes > > cpuid level : 6 > > wp : yes > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge = mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt = pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc ext= d_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic = cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_p= state npt lbrv svm_lock nrip_save pausefilter vmmcall > > bogomips : 6027.25 > > TLB size : 1024 4K pages > > clflush size : 64 > > cache_alignment : 64 > > address sizes : 48 bits physical, 48 bits virtual > > power management: ts ttp tm stc 100mhzsteps hwpstate cpb >=20 > OK, very different CPUs. My guess is that one or both of them don't supp= ort > some feature of the Opteron_G4. When specifying -cpu it's often best > to use the enforce option. >=20 > What happens if you try: >=20 > qemu-system-x86_64 -machine pc,accel=3Dkvm -cpu Opteron_G4,enforce=3Don -= nographic This is the script I'm using right now on both hosts: otubo@vader ~ # cat startvm.sh=20 #/bin/bash =20 /home/otubo/develop/qemu/github/x86_64-softmmu/qemu-system-x86_64 \ -machine pc,accel=3Dkvm -cpu Opteron_G4,enforce=3Don \ -name 'virt-tests-vm1' \ -sandbox off \ -display sdl \ -drive id=3Ddrive_image1,cache=3Dnone,if=3Dnone,file=3D$1 \ -device virtio-blk-pci,id=3Dimage1,drive=3Ddrive_image1,bootindex= =3D0,bus=3Dpci.0,addr=3D04 \ -device virtio-net-pci,mac=3D9a:22:23:24:25:26,id=3DidqE7Ggl,vector= s=3D4,netdev=3DidjYAneH,bus=3Dpci.0,addr=3D05 \ -netdev user,id=3DidjYAneH,hostfwd=3Dtcp::5001-:22 \ -m 2G,slots=3D32,maxmem=3D10G \ -smp 2,maxcpus=3D10,cores=3D1,threads=3D1,sockets=3D2 \ -boot order=3Dcdn,once=3Dc,menu=3Doff \ -enable-kvm >=20 > on both hosts? The output follows, Intel host: otubo@vader ~ # ./startvm.sh /media/virt_images/pb-debian-7-server-late= st.qcow2=20 warning: host doesn't support requested feature: CPUID.80000001H:ECX.ss= e4a [bit 6] warning: host doesn't support requested feature: CPUID.80000001H:ECX.mi= salignsse [bit 7] warning: host doesn't support requested feature: CPUID.80000001H:ECX.3d= nowprefetch [bit 8] warning: host doesn't support requested feature: CPUID.80000001H:ECX.xo= p [bit 11] warning: host doesn't support requested feature: CPUID.80000001H:ECX.fm= a4 [bit 16] qemu-system-x86_64: Host doesn't support requested features AMD host: otubo@qemu-kvm-testrunner [2015-07-29 14:41:40] ~ # ./startvm-incoming.= sh /media/virt_images/pb-debian-7-server-latest.qcow2=20 warning: host doesn't support requested feature: CPUID.01H:ECX.pclmulqd= q|pclmuldq [bit 1] warning: host doesn't support requested feature: CPUID.01H:ECX.ssse3 [b= it 9] warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.1|s= se4_1 [bit 19] warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.2|s= se4_2 [bit 20] warning: host doesn't support requested feature: CPUID.01H:ECX.aes [bit= 25] warning: host doesn't support requested feature: CPUID.01H:ECX.xsave [b= it 26] warning: host doesn't support requested feature: CPUID.01H:ECX.avx [bit= 28] warning: host doesn't support requested feature: CPUID.80000001H:EDX.rd= tscp [bit 27] warning: host doesn't support requested feature: CPUID.80000001H:ECX.xo= p [bit 11] warning: host doesn't support requested feature: CPUID.80000001H:ECX.fm= a4 [bit 16] qemu-system-x86_64: Host doesn't support requested features > You need to pick a CPU option that works with that on both of the hosts. >=20 So you think it's just a matter of fine tunning which CPU option is best for live migration on each platform? Or it should be handled inside Qemu itself? Regards, --=20 Eduardo Otubo ProfitBricks GmbH --5mCyUwZo2JvN/JJP Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJVuMtnAAoJEP0M/1sS+L0vAzMH/2umAY3ua2xJUI+DrO2BjwiW nViguukqndaAAwvjAWqBEJ4kfftDTShQdemvyNH3Rld402y3cxpRVxPgNWQnK+zw sS9ihI926oLjvIuUSIaX/CFC1L8xuh0qmTy4stmaxCwGODuOKcPXgM2PkYGjozDO Pa85bLVjAitPqI660N5T7OVf/ZQYQJLWfE0GWpFoSpY7vl2dBwIap9B0+XLD9N/Q 0Z93TuUpm0hy+VKxTbmxDos52SV9PE3TEcFQybda2rhObCVKZcSIbciBj2lYkYqs 73s/bC9DLYEf1EZADwv7t1nRBaNmN1ZiTbtQdH4FWlDRi3K5INyAUnnrkqbBKLw= =3069 -----END PGP SIGNATURE----- --5mCyUwZo2JvN/JJP--