From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gwu.lbox.cz ([62.245.111.132]:53950 "EHLO gwu.lbox.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934880AbeCGU3T (ORCPT ); Wed, 7 Mar 2018 15:29:19 -0500 Date: Wed, 7 Mar 2018 21:29:10 +0100 From: Nikola Ciprich To: =?utf-8?B?546L6YeR5rWm?= Cc: KVM list , nik@linuxbox.cz, stable@vger.kernel.org Subject: Re: 4.14.18 -> 4.14.24 - almost all guests hanged Message-ID: <20180307202910.GA1527@localhost.localdomain> References: <20180305083606.GA3004@pcnci.linuxbox.cz> <20180307145623.GH28488@pcnci.linuxbox.cz> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="TB36FDmn/VVEgNH/" Content-Disposition: inline In-Reply-To: <20180307145623.GH28488@pcnci.linuxbox.cz> Sender: stable-owner@vger.kernel.org List-ID: --TB36FDmn/VVEgNH/ Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, > > > I'd like to report that when upgrading our cluster from 4.14.18 to > > > 4.14.24-rc1 (with live guests migration), almost none of guests surv= ived.. > > What's your hardware setup, intel with IBPB enabled microcode? > Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz >=20 > therefore I suppose no IBPB (at least meltdown checker reports so) >=20 >=20 > > Does guests hang right after live migration? > yes, just tried it. >=20 >=20 > >=20 > > Are you able to reproduce the problem, does it work with latest upstrea= m? > yup, so I'm able to reproduce quickly. I'll revert the cluster to 4.14.18= now, > but setup test system just afterwards, so and test the patch you've propo= sed. >=20 > >=20 > > Not sure it helps, but following patch is missing in 4.14.24 > >=20 > > commit 37b95951c58fdf08dc10afa9d02066ed9f176fb5 upstream. > >=20 > > kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit > > status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is > > to fix it. > >=20 > > Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set) > > Reported-by: Jeremi Piotrowski > > Cc: Paolo Bonzini > > Cc: Radim Kr=C4=8Dm=C3=A1=C5=99 > > Signed-off-by: Tianyu Lan > > Signed-off-by: Radim Kr=C4=8Dm=C3=A1=C5=99 >=20 > I'll test and report. so indeed, this one on top of 4.14.24-rc1 fixes the migration for me. Greg, could you queue this one up please? Jack, thanks for the hint! BR nik >=20 > n. >=20 >=20 > >=20 > > Regards, > > Jack > > > > > > I noticed that most of them got stuck in "paused" state without > > > possibility to resume (virsh just reported guest cannot be continued = and > > > needs to be rebooted). > > > > > > in dmesg, lots of following messages appeared: > > > > > > [ 116.593508] device vnet0 entered promiscuous mode > > > [ 124.143532] *** Guest State *** > > > [ 124.143594] CR0: actual=3D0x0000000000000030, shadow=3D0x000000006= 0000010, gh_mask=3Dfffffffffffffff7 > > > [ 124.143668] CR4: actual=3D0x0000000000002050, shadow=3D0x000000000= 0000000, gh_mask=3Dffffffffffffe871 > > > [ 124.143871] CR3 =3D 0x00000000feffc000 > > > [ 124.143984] RSP =3D 0xffffffff82003e98 RIP =3D 0xffffffff816df002 > > > [ 124.144102] RFLAGS=3D0x00000246 DR7 =3D 0x0000000000000400 > > > [ 124.144221] Sysenter RSP=3D0000000000000000 CS:RIP=3D0000:00000000= 00000000 > > > [ 124.144341] CS: sel=3D0xf000, attr=3D0x0009b, limit=3D0x0000ffff= , base=3D0x00000000ffff0000 > > > [ 124.144516] DS: sel=3D0x0000, attr=3D0x00093, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.144692] SS: sel=3D0x0000, attr=3D0x00093, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.144907] ES: sel=3D0x0000, attr=3D0x00093, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.145089] FS: sel=3D0x0000, attr=3D0x00093, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.145272] GS: sel=3D0x0000, attr=3D0x00093, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.145447] GDTR: limit=3D0x0000ffff, ba= se=3D0x0000000000000000 > > > [ 124.145626] LDTR: sel=3D0x0000, attr=3D0x00082, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.145814] IDTR: limit=3D0x0000ffff, ba= se=3D0x0000000000000000 > > > [ 124.145995] TR: sel=3D0x0000, attr=3D0x0008b, limit=3D0x0000ffff= , base=3D0x0000000000000000 > > > [ 124.146173] EFER =3D 0x0000000000000000 PAT =3D 0x00070406000= 70406 > > > [ 124.146292] DebugCtl =3D 0x0000000000000000 DebugExceptions =3D 0= x0000000000000000 > > > [ 124.146466] Interruptibility =3D 00000000 ActivityState =3D 00000= 000 > > > [ 124.146579] *** Host State *** > > > [ 124.146687] RIP =3D 0xffffffffa046a817 RSP =3D 0xffffc900200a7cb8 > > > [ 124.146832] CS=3D0010 SS=3D0018 DS=3D0000 ES=3D0000 FS=3D0000 GS= =3D0000 TR=3D0040 > > > [ 124.146961] FSBase=3D00007fe82eff7700 GSBase=3Dffff881fffb40000 TR= Base=3Dfffffe00000df000 > > > [ 124.147144] GDTBase=3Dfffffe00000dd000 IDTBase=3Dfffffe0000000000 > > > [ 124.147262] CR0=3D0000000080050033 CR3=3D0000001f5b8fe004 CR4=3D00= 000000000626e0 > > > [ 124.147381] Sysenter RSP=3Dfffffe00000de200 CS:RIP=3D0010:ffffffff= 81801f60 > > > [ 124.147499] EFER =3D 0x0000000000000d01 PAT =3D 0x0407050600070106 > > > [ 124.147614] *** Control State *** > > > [ 124.147734] PinBased=3D0000007f CPUBased=3D96a1e9fa SecondaryExec= =3D000004f2 > > > [ 124.147849] EntryControls=3D0000d1ff ExitControls=3D002fefff > > > [ 124.147965] ExceptionBitmap=3D00060042 PFECmask=3D00000000 PFECmat= ch=3D00000000 > > > [ 124.148085] VMEntry: intr_info=3D80000081 errcode=3D00000000 ilen= =3D00000000 > > > [ 124.148201] VMExit: intr_info=3D00000000 errcode=3D00000000 ilen= =3D00000000 > > > [ 124.148318] reason=3D80000021 qualification=3D000000000000= 0000 > > > [ 124.148432] IDTVectoring: info=3D00000000 errcode=3D00000000 > > > [ 124.148545] TSC Offset =3D 0xffed7296fb06bc34 > > > [ 124.148655] TPR Threshold =3D 0x00 > > > [ 124.148770] EPT pointer =3D 0x0000001f1a0af01e > > > [ 124.148882] PLE Gap=3D00000080 Window=3D00001000 > > > [ 124.148995] Virtual processor ID =3D 0x0001 > > > > > > (never seen anything like that) > > > > > > I haven't yet went through all patches between those two versions, so= don't > > > have any suspicion yet.. If anyone recognizes this as known problem, = please > > > let me know.. > > > > > > I'm going to try whether I'm able to reproduce the problem. > > > > > > BR > > > > > > nik > >=20 >=20 > --=20 > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava >=20 > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz >=20 > mobil servis: +420 737 238 656 > email servis: servis@linuxbox.cz > ------------------------------------- >=20 --=20 ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz ------------------------------------- --TB36FDmn/VVEgNH/ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlqgS5YACgkQ3xdJJrLygV6GiQCfShcOwvRH3KMJhV3qstBE8kk9 QPoAn39eLecRRA0xDj1KUapsvkZ30TTo =zc80 -----END PGP SIGNATURE----- --TB36FDmn/VVEgNH/--