From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753000AbbIWIbI (ORCPT ); Wed, 23 Sep 2015 04:31:08 -0400 Received: from smtp02.citrix.com ([66.165.176.63]:22449 "EHLO SMTP02.CITRIX.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752494AbbIWIbD (ORCPT ); Wed, 23 Sep 2015 04:31:03 -0400 X-IronPort-AV: E=Sophos;i="5.17,577,1437436800"; d="asc'?scan'208";a="305492311" Message-ID: <1442997058.24964.20.camel@citrix.com> Subject: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy From: Dario Faggioli To: Juergen Gross , George Dunlap CC: "xen-devel@lists.xenproject.org" , "Andrew Cooper" , "Luis R. Rodriguez" , linux-kernel , "David Vrabel" , Boris Ostrovsky , Stefano Stabellini Date: Wed, 23 Sep 2015 10:30:58 +0200 In-Reply-To: <56022C57.3000409@suse.com> References: <1439913332.4239.134.camel@citrix.com> <55D61964.90608@suse.com> <1442335855.7789.45.camel@citrix.com> <55FF9A50.9040505@suse.com> <5600DC4B.1000509@suse.com> <56018050.1010009@citrix.com> <56022C57.3000409@suse.com> Organization: Citrix Inc. Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-Zm/m9SvYUW3nlmXYdoMr" X-Mailer: Evolution 3.16.5 (3.16.5-1.fc22) MIME-Version: 1.0 X-DLP: MIA1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-Zm/m9SvYUW3nlmXYdoMr Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, 2015-09-23 at 06:36 +0200, Juergen Gross wrote: > On 09/22/2015 06:22 PM, George Dunlap wrote: > > Juergen / Dario, could one of you summarize your two approaches,=20 > > and the > > (alleged) advantages and disadvantages of each one? >=20 > Okay, I'll have a try: >=20 Thanks for this! ;-) > The problem we want to solve: > ----------------------------- >=20 > The Linux kernel is gathering cpu topology data during boot via the > CPUID instruction on each processor coming online. This data is > primarily used in the scheduler to decide to which cpu a thread > should > be migrated when this seems to be necessary. There are other users of > the topology information in the kernel (e.g. some drivers try to do > optimizations like core-specific queues/lists). >=20 > When started in a virtualized environment the obtained data is next > to > useless or even wrong, as it is reflecting only the status of the > time > of booting the system. Scheduling of the (v)cpus done by the > hypervisor > is changing the topology beneath the feet of the Linux kernel without > reflecting this in the gathered topology information. So any > decisions > taken based on that data will be clueless and possibly just wrong. >=20 Exactly. > The minimal solution is to change the topology data in the kernel in > a > way that all cpus are regarded as equal regarding their relation to > each > other (e.g. when migrating a thread to another cpu no cpu is > preferred > as a target). >=20 > The topology information of the CPUID instruction is, however, even > accessible form user mode and might be used for licensing purposes of > any user program (e.g. by limiting the software to run on a specific > number of cores or sockets). So just mangling the data returned by > CPUID in the hypervisor seems not to be a general solution, while we > might want to do it at least optionally in the future. >=20 Yep. It turned out that, although being what started all this, CPUID handling is a somewhat related but mostly independent problem. :-) > In the future we might want to support either dynamic topology > updates > or be able to tell the kernel to use some of the topology data, e.g. > when pinning vcpus. >=20 Indeed. At least for the latter. Dynamic looks really difficult to me, but indeed it would be ideal. Let's see. > Solution 1 (Dario): > ------------------- >=20 > Don't use the CPUID derived topology information in the Linux > scheduler, > but let it use a simple "flat" topology by setting own scheduler > domain > data under Xen. >=20 > Advantages: > + very clean solution regarding the scheduler interface > Yes, this is, I think, one of the main advantages of the patch. The scheduler is offering an interface to architectures to define their topology requirements and I'm using it, for specifying our topology requirements: the tool for the job. :-D > + scheduler decisions are based on a minimal data set > + small patch >=20 > Disadvantages: > - covers the scheduler only, drivers still use the "wrong" data > This is a good point. It was the patch's purpose, TBH, but it's certainly true that, if we need something similar elsewhere, we need to do more. > - a little bit hacky regarding some NUMA architectures (needs either > a > hook in the code dealing with that architecture or multiple > scheduler > domain data overwrites) > As I said in my other email, I'll double check (yes, I also think this is about AMD boxes with intra-socket NUMA nodes). > - future enhancements will make the solution less clean (either need > duplicating scheduler domain data or some new hooks in scheduler > domain interface) >=20 This one, I'm not sure I understand. > Solution 2 (Juergen): > --------------------- >=20 > When booted as a Xen guest modify the topology data built during boot > resulting in the same simple "flat" topology as in Dario's solution. >=20 > Advantages: > + the simple topology is seen by all consumers of topology data as > the > data itself is modified accordingly > Yep, that's a good point. > + small patch > + future enhancements rather easy by selecting which data to modify > As for the '-' above about this, I'm not really sure what this means. >=20 > Disadvantages: > - interface to scheduler not as clean as in Dario's approach > - scheduler decisions are based on multiple layers of topology data > where one layer would be enough to describe the topology >=20 This is not too big of a deal, IMO. Not at runtime, at least, as far as my investigation went for now. Initialization (of scheduling domains) is a bit clumsy in this case, as scheduling domains are created and then destroyed/collapsed, but after they are setup, the net effect is that there's only one scheduling domain with Juergen's patch too, exactly as with mine. > Dario, are you okay with this summary? > To most of it, yes, and thanks again for it. Allow me to add a few points, out of the top of my head: * we need to check whether the two approaches have the same=20 performance. In principle, they really should, and early results=20 seems to confirm that, but I'd like to run the full set of benches=20 (and I'll do that ASAP); * I think we want to run even more benchmarks, and run them in =20 different (over)load conditions to better assess the effect of the=20 change * both our patches provides a solution for Xen (for Xen PV guests, at=20 least for now, to be more precise). It is very likely that, e.g.,=20 KVM is in a similar situation, hence it may be worth to look for a=20 more general solution, especially if that buys us something (e.g.,=20 HVM support made easy?) Thanks and Regards, Dario PS. BTW, Juergen, you're not on IRC, on #xendevel, are you? --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-Zm/m9SvYUW3nlmXYdoMr Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEABECAAYFAlYCY0IACgkQk4XaBE3IOsTU5gCgl7zkw9bS5vYaZpS+wktXWbQn PLAAn3m9r4a6rnTEXZBBjCrar+48AkkR =LTH0 -----END PGP SIGNATURE----- --=-Zm/m9SvYUW3nlmXYdoMr--