From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: some thoughts about merlot{0|1} issues [was: Re: [xen-unstable test] 102522: tolerable FAIL - PUSHED] Date: Thu, 24 Nov 2016 16:14:23 +0100 Message-ID: <1480000463.2712.150.camel@citrix.com> References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6245045683895187318==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: xen-devel@lists.xensource.com Cc: Boris Ostrovsky , Ian Jackson , Wei Liu , Jan Beulich List-Id: xen-devel@lists.xenproject.org --===============6245045683895187318== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-zT3fnojLpZaRcz2q3AA3" --=-zT3fnojLpZaRcz2q3AA3 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, 2016-11-23 at 15:54 +0000, osstest service owner wrote: > flight 102522 xen-unstable real [real] > http://logs.test-lab.xenproject.org/osstest/logs/102522/ >=20 > Regressions which are regarded as allowable (not blocking): > =C2=A0test-amd64-amd64-xl-rtds=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A09 debia= n- > install=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0fail=C2=A0=C2=A0like 102465 >=20 This is on merlot1, and as far as I can tell, this test is failing there since quite a while (is that the correct interpretation of this table?): http://logs.test-lab.xenproject.org/osstest/results/history/test-amd64-amd6= 4-xl-rtds/xen-unstable This is using RTDS as scheduler, but that should not be the problem. In fact, what's failing is xen-create-image timing out. Basically, it starts creating the VM filesystem via debootstrap, but does not manage to finish doing that within the 2530 secs timeout: http://logs.test-lab.xenproject.org/osstest/logs/102522/test-amd64-amd64-xl= -rtds/9.ts-debian-install.log 2016-11-23 13:12:35 Z command timed out [2500]: timeout 2530 ssh -o StrictH= ostKeyChecking=3Dno -o BatchMode=3Dyes -o ConnectTimeout=3D100 -o ServerAli= veInterval=3D100 -o PasswordAuthentication=3Dno -o ChallengeResponseAuthent= ication=3Dno -o UserKnownHostsFile=3Dtmp/t.known_hosts_102522.test-amd64-am= d64-xl-rtds root@172.16.144.21 http_proxy=3Dhttp://cache:3143/ \ xen-create-image \ In other runs on different hosts, still under RTDS, that takes about 650 seconds: http://logs.test-lab.xenproject.org/osstest/logs/102532/test-amd64-amd64-xl= -rtds/9.ts-debian-install.log And I've tried myself on my test box, and it took 10m20s. We know from here, that, this time, it got stuck rather early: http://logs.test-lab.xenproject.org/osstest/logs/102522/test-amd64-amd64-xl= -rtds/merlot1---var-log-xen-tools-debian.guest.osstest.log I've looked at a handful of other instances, and it seems to be _always_ like that. The system appears alive though, or at least right after the timeout, the ts-log-capture phase --which includes issuing commands on the host and copying files from there-- succeeds. Also, not sure it means much, but xen-create-image starts at 12:30:55=C2=A0and times out at=C2=A013:12:35. Looking in: http://logs.test-lab.xenproject.org/osstest/logs/102522/test-amd64-amd64-xl= -rtds/merlot1---var-log-daemon.log we see: Nov 23 12:04:55 merlot1 ntpd[3003]: Listening on routing socket on fd #22 f= or interface updates [..] Nov 23 12:27:22 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes Nov 23 12:34:04 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes Nov 23 12:40:47 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes Nov 23 12:47:31 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes Nov 23 12:54:14 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes Nov 23 13:00:57 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes Nov 23 13:07:40 merlot1 init: Id "T0" respawning too fast: disabled for 5 m= inutes So, again, at least something is alive on the host, and writing in the logs, even during the time that debootstrap seems stuck. There are 2 running vcpus: Name ID VCPU CPU State Time(s) Affinity= (Hard / Soft) Domain-0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00=C2=A0=C2=A0=C2=A0=C2=A0=C2=A01=C2=A0=C2= =A0=C2=A0=C2=A02=C2=A0=C2=A0=C2=A0r--=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A035= .7=C2=A0=C2=A0all / all Domain-0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00=C2=A0=C2=A0=C2=A0=C2=A017=C2=A0=C2=A0=C2= =A0=C2=A09=C2=A0=C2=A0=C2=A0r--=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A038.3=C2= =A0=C2=A0all / all vcpu 17 is running on CPU 9 which is on node 1, which has _0_ memory: node: memsize memfree distances 0: 9216 7856 10,16,16,16 1: 0 0 16,10,16,16 2: 8175 7779 16,16,10,16 3: 0 0 16,16,16,10 (see here: http://logs.test-lab.xenproject.org/osstest/logs/102522/test-amd= 64-amd64-xl-rtds/merlot1-output-xl_info_-n ) but that should not mean much, and in other, still failing, runs, this does not happen (also, this is just what is going on while we collect logs). Now, about serial output: http://logs.test-lab.xenproject.org/osstest/logs/102522/test-amd64-amd64-xl= -rtds/serial-merlot1.log When dumping ACPI C states, here's how things look like for _all_ CPUs: Nov 23 13:13:00.382134 (XEN) =3D=3Dcpu3=3D=3D Nov 23 13:13:00.382157 (XEN) active state: C-1 Nov 23 13:13:00.390096 (XEN) max_cstate: C7 Nov 23 13:13:00.390125 (XEN) states: Nov 23 13:13:00.390148 (XEN)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0C1: type[C1] late= ncy[000] usage[00000000] method[ HALT] duration[0] Nov 23 13:13:00.398055 (XEN)=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0C0: usage[0000000= 0] duration[4229118701384] Nov 23 13:13:00.398090 (XEN) PC2[0] PC3[0] PC6[0] PC7[0] Nov 23 13:13:00.406088 (XEN) CC3[0] CC6[0] CC7[0] And I checked other runs, and it's the same everywhere. I remember that Jan suggested trying to pass max_cstate=3D1 to Xen at boot. I was about to ask Ian to do that for this host, but it looks like we're using only C0 and C1 already anyway. Boot command line looks like this: xen_commandline=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0: placeholde= r conswitch=3Dx watchdog com1=3D115200,8n1 console=3Dcom1,vga gdb=3Dcom1 do= m0_mem=3D512M,max:512M ucode=3Dscan sched=3Drtds which makes the above look a bit weird to me... But I've played much more with Intel boxes than with AMD ones, I admit. For now, I'm done. At some point, I'll recall either merlot0 or merlot1 out of OSSTest, take it back for myself, and try to investigate more. If, it in the meantime, any of this rings a bell for anyone, feel free to speak up. Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-zT3fnojLpZaRcz2q3AA3 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJYNwPPAAoJEBZCeImluHPuz+4QANvv6ae2/987lA7Ur2ukP4JO 7PFdYA7m+t3+hAlLH0EAi8ScYWvWzNiwJDQnwIxgdf53edU90NSy+1M4v4C7MNK9 s2er6d8fXCDyJZI35qJ0iDijbonUqbUEPaxXWu5Kva6kWVY5k4ExpGKushFAAizf wfkmlCq96Ba1Ac+IC/NZGo2+muZTF73jOLSl+RVUw7pR9pu2AGa8Ouvi7noxx2SK UcOhvks8Y11ihrQveEdqnzC/MS5zvRGxrEKeKmCX6z38v6mPwD2wmkmQpTJjTRpv I6qsUYrsYQsyztrM1jEthxCKKhM6UvEpfN0+BVh0qN9WfQBb/wDFnT+xZfgEu4c9 wmw5lznE8ZHYl+RV1i1AF/jRIOXz9qbaE5WAdFypVnP/tBIWC1RkoLiQghBbE/O0 PsQVGW73NF59LXL2WHRPVo3tNLa7142WzOWcSqd8tCt2ygkJOtK4IMbpaKvSsc/g cm40pAHP+4BsoW6PW2UboygY4KjblAoSBX33jIhdfZ9U90xSq+u7aLfpTg8CSkfB rhmYjYl0bhP1Sn2wpIKDFkNmlVdI3IppfUWjUMtrVC4fSOThP5EzUW/OcYW18JyZ tzQX5AyEv7EVDz0P7hgWoxthH4UebSkNDAxefNb7YzEdfOMA9m0bZbPW/8BOw4cL arTJSmVWSLIWHL0fhtC1 =+0XH -----END PGP SIGNATURE----- --=-zT3fnojLpZaRcz2q3AA3-- --===============6245045683895187318== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============6245045683895187318==--