From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes Date: Wed, 18 Jul 2012 00:15:20 +0200 Message-ID: <1342563320.11794.19.camel@Abyss> References: <0411b2cebd725b193465.1341932614@Solace> <20485.35590.105351.434937@mariner.uk.xensource.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3449445333181444075==" Return-path: In-Reply-To: <20485.35590.105351.434937@mariner.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Jackson Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , George Dunlap , Juergen Gross , xen-devel List-Id: xen-devel@lists.xenproject.org --===============3449445333181444075== Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-cX89JwYvYLhY63OQACWX" --=-cX89JwYvYLhY63OQACWX Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote:=20 > Dario Faggioli writes ("[PATCH 1 of 3 v4/leftover] libxl: enable automati= c placement of guests on NUMA nodes"): > > If a domain does not have a VCPU affinity, try to pin it automatically = to some > > PCPUs. This is done taking into account the NUMA characteristics of the= host. > > In fact, we look for a combination of host's NUMA nodes with enough fre= e memory > > and number of PCPUs for the new domain, and pin it to the VCPUs of thos= e nodes. >=20 > Thanks for this admirably clear patch. >=20 Thanks to you for looking at it. > Can you please rewrap your commit messages to around 70 lines ? Many > VCSs indent them in the log in some situations, and as you see here > mail programs indent them when quoting too. >=20 That should have happened in the first place. As I'm reposting, I'll take extra attention to that, sorry. > > +/* Subtract two values and translate the result in [0, 1] */ > > +static double normalized_diff(double a, double b) > > +{ > > +#define max(a, b) (a > b ? a : b) > > + if (!a && a =3D=3D b) > > + return 0.0; > > + return (a - b) / max(a, b); > > +} >=20 > 1. This macro max() should be in libxl_internal.h. > 2. It should be MAX so people are warned it's a macro > 3. It should have all the necessary ()s for macro precedence safety >=20 Ok, will do that. > > + double freememkb_diff =3D normalized_diff(c2->free_memkb, c1->free= _memkb); > > + double nrdomains_diff =3D normalized_diff(c1->nr_domains, c2->nr_d= omains); > > + > > + if (c1->nr_nodes !=3D c2->nr_nodes) > > + return c1->nr_nodes - c2->nr_nodes; > > + > > + return sign(3*freememkb_diff + nrdomains_diff); >=20 > The reason you need what sign() does is that you need to convert from > double to int, I guess. >=20 Mostly for that and to make what's happening even more clear. > > + > > + /* > > + * Check if the domain has any CPU affinity. If not, try to build = up one. > > + * In case numa_place_domain() find at least a suitable candidate,= it will > > + * affect info->cpumap accordingly; if it does not, it just leaves= it > > + * as it is. This means (unless some weird error manifests) the su= bsequent > > + * call to libxl_set_vcpuaffinity_all() will do the actual placeme= nt, > > + * whatever that turns out to be. > > + */ > > + if (libxl_bitmap_is_full(&info->cpumap)) { > > + int rc =3D numa_place_domain(gc, info); > > + if (rc) > > + return rc; > > + } >=20 > I guess it would be preferable to do this only if the bitmap was full > by default, so that setting the bitmap explicitly to all cpus still > works. >=20 > I'm not sure that that's essential to have, though. >=20 I was thinking about this right in this days, and I think it should be exactly as you say, as one need a mechanism for disabling this thing as a whole. I really don't think it should take to much to put something together, even as a separate, future, patch, if these get checked-in. Thanks. > > + /* > > + * Conditions are met, we can add this combination to the > > + * NUMA placement candidates list. We first make sure ther= e > > + * is enough space in there, and then we initialize the ne= w > > + * candidate element with the node map corresponding to th= e > > + * combination we are dealing with. Memory allocation for > > + * expanding the array that hosts the list happens in chun= ks > > + * equal to the number of NUMA nodes in the system (to > > + * avoid allocating memory each and every time we find a > > + * new candidate). > > + */ > > + if (*nr_cndts =3D=3D array_size) > > + array_size +=3D nr_nodes; > > + GCREALLOC_ARRAY(new_cndts, array_size); >=20 > This part of the algorithm is quadratic in the number of combinations > divided by the number of nodes. So the algorithm is > O( ( C( nr_nodes, min_nodes ) / min_nodes )^2 ) > which is quite bad really. >=20 I might well be wrong, but I was thinking to it as something like this: O( C(nr_nodes,(nr_nodes/2)) * nr_nodes ) That's because the external while() is repeated, at most, nr_nodes times (if min_nodes=3D1 and max_nodes=3Dnr_nodes). Each of these steps hosts a for() which visits all the combinations, the maximum number of which ISTR to be C(nr_nodes,(nr_nodes/2)). I'm not sure what you meant when putting min_nodes up there in your formula (min_nodes is likely to be 1 most of the cases...), so I can't get numbers and compare them, but it looked (looks?) a bit less bad to me... Or did I make some obvious mistake I'm not seeing right now? :-( > At the very least this needs to be an exponential allocation, eg > + array_size +=3D nr_nodes + array_size; >=20 > But if you didn't insist on creating the whole list and sorting it, > you would avoid this allocation entirely, wouldn't you ? >=20 I'll kill the separation between candidate identification and sorting: that's easy, quick, and will make George happy as well. :-) > Should we bve worried that this algorithm will be too slow even if it > involves just > O( C(nr_nodes,min_nodes) ) > iterations ? >=20 I'm commenting about this in the other thread you opened... Thanks and Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-cX89JwYvYLhY63OQACWX Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEABECAAYFAlAF4/gACgkQk4XaBE3IOsSU0gCdEdtmFxKW4Rh8/QNTj+bSix51 X6wAn2rS8dgpt2ZIwP3eQx8bqhbeEfGE =kh/F -----END PGP SIGNATURE----- --=-cX89JwYvYLhY63OQACWX-- --===============3449445333181444075== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============3449445333181444075==--