From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH v2] sched: credit2: respect per-vcpu hard affinity Date: Mon, 9 Mar 2015 15:07:37 +0000 Message-ID: <1425913651.2729.87.camel@citrix.com> References: <1423453550-3526-1-git-send-email-jtweaver@hawaii.edu> <1425352508.10194.241.camel@citrix.com> <54F9C55F.5070608@eu.citrix.com> <1425661338.16932.52.camel@citrix.com> <54FD87BD.8000806@eu.citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1776275049535943026==" Return-path: In-Reply-To: <54FD87BD.8000806@eu.citrix.com> Content-Language: en-US List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap Cc: "xen-devel@lists.xen.org" , "jtweaver@hawaii.edu" , "JBeulich@suse.com" , "henric@hawaii.edu" List-Id: xen-devel@lists.xenproject.org --===============1776275049535943026== Content-Language: en-US Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-1ifxGyOu52BM/haD41Zn" --=-1ifxGyOu52BM/haD41Zn Content-Type: multipart/mixed; boundary="=-OrzIlTABTSsbabFyjZRm" --=-OrzIlTABTSsbabFyjZRm Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, 2015-03-09 at 11:45 +0000, George Dunlap wrote: > On 03/09/2015 07:11 AM, Justin Weaver wrote: > > I don't think there's any way the mask can be empty after the > > cpumask_and with the cpu_hard_affinity and the VCPU2ONLINE. In > > schedule.c:vcpu_set_hard_affinity, if the intersection of the new hard > > mask and VCPU2ONLINE is empty, the function returns -EINVAL. I took > > into account all the discussion here and modified the function for v3. >=20 > What about this: >=20 > * Create a cpu pool with cpus 0 and 1. online_cpus is now [0,1]. > * Set a hard affinity of [1]. This succeeds. > So, I decided to try this: # xl cpupool-list -c Name CPU list Pool-0 0,1 # xl list -c Name ID Mem VCPUs State Time(s) = Cpupool Domain-0 0 507 4 r----- = 19.9 Pool0 test-pv 1 2000 8 -b---- = 19.2 Pool0 # xl vcpu-list test-pv Name ID VCPU CPU State Time(s) Affinity= (Hard / Soft) test-pv 1 0 1 -b- 5.5 1 / 1 test-pv 1 1 1 -b- 2.4 1 / 1 test-pv 1 2 1 -b- 2.9 1 / 1 test-pv 1 3 1 -b- 1.9 1 / 1 test-pv 1 4 1 -b- 1.0 1 / 1 test-pv 1 5 1 -b- 2.0 1 / 1 test-pv 1 6 1 -b- 1.2 1 / 1 test-pv 1 7 1 -b- 2.4 1 / 1 # xl cpupool-cpu-remove Pool0 1 (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Assertion '!cpumask_empty(dom_cpumask)' failed at domain.c:460 (XEN) **************************************** i.e., surprise surprise, there's already an assert guarding this... and it triggers!! :-( It seems to have been added in v4 of the per-vcpu soft affinity work: http://xenbits.xen.org/gitweb/?p=3Dpeople/dariof/xen.git;a=3Dcommit;h=3D4d4= e999637f38e0bbd4318ad8e0c92fd1e580430 So, we must have had this discussion before! :-) I've done a bit of archeology and the ASSERT() in domain_update_node_affinity() was introduced by me (upon request) while working on implementing per-vcpu soft affinity. Then, cs 93be8285 is what causes the assert to trigger. Time seems not to match, but that's because soft affinity was put on hold when it was decided not to include it in 4.4, and I probably forgot to retest with a use case similar to the above when resubmitting it! :-( A patch to fix things is attached to this email, for convenience (I'll submit it properly, with changelog and everything, right away). After seeing this, I'm even more convinced that !empty(online && hard_affinity) is really something we want and, as we ASSERT() it in domain.c, the same should be done in sched_credit2.c. OTOH, if we don't want to ASSERT() in the specific scheduler code, then I'd remove the one in domain.c too (and, perhaps, just use online as a fallback), or things would look inconsistet. Of course, this can also be seen as a proof of George's point, that this is something not obvious and not easy to catch. Still, I think that if we say that we support hard vcpu affinity (a.k.a. pinning), we should not allow vcpus outside their hard affinity, and the code must reflect this. > * Move cpu 1 to a different cpupool. > > After a quick look I don't see anything that updates the hard affinity > when cpus are removed from pools. >=20 cpupool_unassign_cpu() calls cpupool_unassign_cpu_helper()that calls cpu_disable_scheduler() which, if it finds that empty(online && hard_affinity)=3D=3Dtrue, it resets hard_affinity to "all". Note that this is reported in the log, as a confirmation that this is really something exceptional: (XEN) Breaking affinity for d1v0 (XEN) Breaking affinity for d1v1 (XEN) Breaking affinity for d1v2 (XEN) Breaking affinity for d1v3 (XEN) Breaking affinity for d1v4 (XEN) Breaking affinity for d1v5 (XEN) Breaking affinity for d1v6 (XEN) Breaking affinity for d1v7 And also note that, as a consequence of fiddling with cpupools, the affinity has been altered by Xen (i.e., vcpus still always run within their hard affinity masks): # xl vcpu-pin 1 all 1 1 # xl cpupool-cpu-remove Pool-0 1 # xl cpupool-list -c Name CPU list Pool-0 0,2,3,4,5,6,7,8,9,10,11,12,13,14,15 # xl vcpu-list test-pv Name ID VCPU CPU State Time(s) Affinity= (Hard / Soft) test-pv 1 0 2 -b- 6.1 all / 1 test-pv 1 1 6 -b- 1.6 all / 1 test-pv 1 2 4 -b- 1.6 all / 1 test-pv 1 3 5 -b- 3.1 all / 1 test-pv 1 4 4 -b- 0.8 all / 1 test-pv 1 5 7 -b- 3.0 all / 1 test-pv 1 6 3 -b- 1.1 all / 1 test-pv 1 7 3 -b- 0.7 all / 1 That's what drove all the reasoning when changing domain_update_node_affinity(), and led to that ASSERT(), during the soft affinity work. The reason why the ASSERT() triggers, as can be seen in the patch, is that, because of the cited changeset, domain_update_node_affinity() is called too early. > And, even if it does *now*, it's possible that something might be > changed in the future that would forget it; this ASSERT() isn't exactly > next to that code. >=20 But it would help us catch the bug at some point... As proved above! :-D BTW, I've already written and submitted an OSSTest test involving cpupools (we don't do anything at the moment). I'll see about adding these kind of things to it. > So it seems to me like handling that case makes the software > more robust for little cost. >=20 Yep, I also agree that it at least won't cost much, in terms of runtime overhead. Regards, Dario --=-OrzIlTABTSsbabFyjZRm Content-Disposition: attachment; filename="patch" Content-Transfer-Encoding: base64 Content-Type: text/plain; name="patch"; charset="UTF-8" Y29tbWl0IDE0NDMwZDI1NDQ4YWE4NWRmN2NlZjk5MWIwZjUyZGU5MWUzNzQ0MGIKQXV0aG9yOiBE YXJpbyBGYWdnaW9saSA8ZGFyaW8uZmFnZ2lvbGlAY2l0cml4LmNvbT4KRGF0ZTogICBNb24gTWFy IDkgMTU6MjY6MzAgMjAxNSArMDEwMAoKICAgIFNpZ25lZC1vZmYtYnk6IERhcmlvIEZhZ2dpb2xp IDxkYXJpby5mYWdnaW9saUBjaXRyaXguY29tPgoKZGlmZiAtLWdpdCBhL3hlbi9jb21tb24vY3B1 cG9vbC5jIGIveGVuL2NvbW1vbi9jcHVwb29sLmMKaW5kZXggYTc1OGE4Yi4uYWQ3YmJiNSAxMDA2 NDQKLS0tIGEveGVuL2NvbW1vbi9jcHVwb29sLmMKKysrIGIveGVuL2NvbW1vbi9jcHVwb29sLmMK QEAgLTI5Niw2ICsyOTYsOCBAQCBzdGF0aWMgaW50IGNwdXBvb2xfYXNzaWduX2NwdV9sb2NrZWQo c3RydWN0IGNwdXBvb2wgKmMsIHVuc2lnbmVkIGludCBjcHUpCiBzdGF0aWMgbG9uZyBjcHVwb29s X3VuYXNzaWduX2NwdV9oZWxwZXIodm9pZCAqaW5mbykKIHsKICAgICBpbnQgY3B1ID0gY3B1cG9v bF9tb3ZpbmdfY3B1OworICAgIHN0cnVjdCBjcHVwb29sICpjID0gaW5mbzsKKyAgICBzdHJ1Y3Qg ZG9tYWluICpkOwogICAgIGxvbmcgcmV0OwogCiAgICAgY3B1cG9vbF9kcHJpbnRrKCJjcHVwb29s X3VuYXNzaWduX2NwdShwb29sPSVkLGNwdT0lZClcbiIsCkBAIC0zMTgsNiArMzIwLDExIEBAIHN0 YXRpYyBsb25nIGNwdXBvb2xfdW5hc3NpZ25fY3B1X2hlbHBlcih2b2lkICppbmZvKQogICAgICAg ICBjcHVwb29sX2NwdV9tb3ZpbmcgPSBOVUxMOwogICAgIH0KIAorICAgIHJjdV9yZWFkX2xvY2so JmRvbWxpc3RfcmVhZF9sb2NrKTsKKyAgICBmb3JfZWFjaF9kb21haW5faW5fY3B1cG9vbChkLCBj KQorICAgICAgICBkb21haW5fdXBkYXRlX25vZGVfYWZmaW5pdHkoZCk7CisgICAgcmN1X3JlYWRf dW5sb2NrKCZkb21saXN0X3JlYWRfbG9jayk7CisKIG91dDoKICAgICBzcGluX3VubG9jaygmY3B1 cG9vbF9sb2NrKTsKICAgICBjcHVwb29sX2RwcmludGsoImNwdXBvb2xfdW5hc3NpZ25fY3B1IHJl dD0lbGRcbiIsIHJldCk7CkBAIC0zNzksMTIgKzM4Niw2IEBAIHN0YXRpYyBpbnQgY3B1cG9vbF91 bmFzc2lnbl9jcHUoc3RydWN0IGNwdXBvb2wgKmMsIHVuc2lnbmVkIGludCBjcHUpCiAgICAgYXRv bWljX2luYygmYy0+cmVmY250KTsKICAgICBjcHVwb29sX2NwdV9tb3ZpbmcgPSBjOwogICAgIGNw dW1hc2tfY2xlYXJfY3B1KGNwdSwgYy0+Y3B1X3ZhbGlkKTsKLQotICAgIHJjdV9yZWFkX2xvY2so JmRvbWxpc3RfcmVhZF9sb2NrKTsKLSAgICBmb3JfZWFjaF9kb21haW5faW5fY3B1cG9vbChkLCBj KQotICAgICAgICBkb21haW5fdXBkYXRlX25vZGVfYWZmaW5pdHkoZCk7Ci0gICAgcmN1X3JlYWRf dW5sb2NrKCZkb21saXN0X3JlYWRfbG9jayk7Ci0KICAgICBzcGluX3VubG9jaygmY3B1cG9vbF9s b2NrKTsKIAogICAgIHdvcmtfY3B1ID0gc21wX3Byb2Nlc3Nvcl9pZCgpOwo= --=-OrzIlTABTSsbabFyjZRm-- --=-1ifxGyOu52BM/haD41Zn Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlT9tzMACgkQk4XaBE3IOsQ5tACfYd20lxs15xNDu41MOwizPM6m J9oAnjsI4jl3fliD8xjGyCdNYh1CuuH2 =iTTn -----END PGP SIGNATURE----- --=-1ifxGyOu52BM/haD41Zn-- --===============1776275049535943026== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============1776275049535943026==--