From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: Scheduler regression in 4.7 Date: Thu, 11 Aug 2016 16:28:12 +0200 Message-ID: <1470925692.6250.20.camel@citrix.com> References: <0ebca08f-71bd-7b96-9182-caa66e4f370f@citrix.com> <07d7f503-f4e2-8955-0470-55ecbb532b88@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2176573034655892386==" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Andrew Cooper , George Dunlap , George Dunlap Cc: Jan Beulich , Xen-devel List List-Id: xen-devel@lists.xenproject.org --===============2176573034655892386== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-ACCJ9Gby3BDBDwhyvEEq" --=-ACCJ9Gby3BDBDwhyvEEq Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 2016-08-11 at 14:39 +0100, Andrew Cooper wrote: > On 11/08/16 14:24, George Dunlap wrote: > > On 11/08/16 12:35, Andrew Cooper wrote: > > > The actual cause is _csched_cpu_pick() falling over LIST_POISON, > > > which > > > happened to occur at the same time as a domain was shutting > > > down.=C2=A0=C2=A0The > > > instruction in question is `mov 0x10(%rax),%rax` which looks like > > > reverse list traversal. > Thanks for the report. > > Could you use line2addr or objdump -dl to get a better idea where > > the > > #GP is happening? > addr2line -e xen-syms-4.7.0-xs127493 ffff82d08012944f > /obj/RPM_BUILD_DIRECTORY/xen-4.7.0/xen/common/sched_credit.c:775 > (discriminator 1) >=20 > It will be IS_RUNQ_IDLE() which is the problem. >=20 Ok, that does one step of list traversing (the runq). What I didn't understand from your report is what crashed when. IS_RUNQ_IDLE() has been introduced a while back and anything like that has been ever caught so far. George's patch makes _csched_cpu_pick() be called during insert_vcpu()-->csched_vcpu_insert() which, in 4.7, is called: =C2=A01) during domain (well, vcpu) creation, =C2=A02) when domain is moved among cpupools AFAICR, during domain destruction we basically move the domain to cpupool0, and without a patch that I sent recently, that is always done as a full fledged cpupool movement, even if the domain is _already_ in cpupool0. So, even if you are not using cpupools, and since you mention domain shutdown we probably are looking at 2). But this is what I'm not sure I got well... Do you have enough info to tell precisely when the crash manifests? Is it indeed during a domain shutdown, or was it during a domain creation (sched_init_vcpu() is in the stack trace... although I've read it's a non-debug one)? And is it a 'regular' domain or dom0 that is shutting down/coming up? The idea behind IS_RUNQ_IDLE() is that we need to know whether there is someone in the runq of a cpu or not, to correctly initialize --and hence avoid biasing-- some load balancing calculations. I've never liked the idea (leave it alone the code), but it's necessary (or, at least, I don't see a sensible alternative). The questions I'm asking above have the aim of figuring out what the status of the runq could be, and why adding a call to csched_cpu_pick() from insert_vcpu() is making things explode... Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-ACCJ9Gby3BDBDwhyvEEq Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJXrIuJAAoJEBZCeImluHPuZVMP/0sKvNQukgfAzq6pT5Xdlv0g J2RfeTMFKqnwYKNwu1mteTbN6Oq2gIAmye5YSv/CATafn3zO+xOxK+Fz1jglHK5J 43+vS1PLHf9thbQ7P6OOWOGg+KV7RUN9RkAsiRnSdFPVi9oiRshgRRCK3SivVNug Sq74zXJ8DNhKdBegVl49rQ/yBOB7fx7qrj9t7VXQNgdaCKAMtSBAx/bRVC/xZGsS t+6hJBPFdV1pslXDyBRpbmivYusR6e2WI7dEW6IO/RFhBKkViV4dVMxaI44JI2oi zR1k0yrSsMTOvzNRWNR/99xMTyqG9mfrTuzA/4hvb+8KvrjkgPSxj3OM3xEjtG2f 3o93WaVg19kj0hBffET0EaDXH2pAKbOj7K/pd1fHNWkkoOkkd1XHJoaKVqb1SCWl UU3G0P//3L8QImuQYHlgNXF+qKN//5q5J1uUbmGSOfoN6iW17beMmMZwHz0O+cVl 6ZAnHeBLKyWIsdjBZuI+0leqntpUmW3WQYv4mtmuPDnVZMHaZ9iCIfUrNFASVYi1 rg9aMQOVmNoKtXFZpxob9PZISs1Zj0l8fY5B+T+e7RdJzoVwEhZchfof++ejqWld lUXp3jpV1iUKafzb77YFzDzEpBSvsaMgYujUJrIQpa3Qajt+yb35yOFDxIodPEFB vDrazCGD7nQcWum92PPE =CFQO -----END PGP SIGNATURE----- --=-ACCJ9Gby3BDBDwhyvEEq-- --===============2176573034655892386== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============2176573034655892386==--