From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Date: Fri, 13 Apr 2012 00:21:37 +0200 Message-ID: <1334269297.2417.23.camel@Abyss> References: <7e76233448b02810f0ae.1334150272@Solace> <4F86AD6E.3050705@eu.citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5653723866298172454==" Return-path: In-Reply-To: <4F86AD6E.3050705@eu.citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Jan Beulich List-Id: xen-devel@lists.xenproject.org --===============5653723866298172454== Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-LZ8ELMCpHrrBBTavRHqM" --=-LZ8ELMCpHrrBBTavRHqM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 2012-04-12 at 11:24 +0100, George Dunlap wrote:=20 > > A valid setting will directly impact the node_affinity mask > > of the guest, i.e., it will change the behaviour of the low level > > memory allocator. On the other hand, this commit does not affect > > by any means how the guest's vCPUs are scheduled on the host's > > pCPUs. > I would probably break this into three separate patches, starting with= =20 > the hypervisor, then libxc, then libxl, especially since they tend to=20 > have different maintainers and committers. > That's fine. > > +A list of nodes should be specified as follows: `nodes=3D["0", "3"]` > > +for the guest to be affine with nodes 0 and 3. > > + > Hmm, I think that using "affine" here is technically correct, and is=20 > what one would use if writing a research paper; but it's unusual to hear= =20 > the word in more common English; it would be more common to hear someone= =20 > describe a VM as "having affinity with". > I see... > How about something like this: >=20 > "The list of NUMA nodes the domain is considered to have affinity with. = =20 > Memory from the guest will be allocated from these nodes." >=20 Sounds good, I'll go for it. Thanks for looking at these stuff too, I really need and enjoy some good English training! :-P > Also, is there a way to specify that the affinity to be to all nodes=20 > and/or based on the vcpu mask of the vcpus? > Well, yes but no. :-) Actually, if you do not specify any affinity, the default is right now 'all nodes'. As we might want to change that, I'll add something like `nodes=3D"all"`. Similarly, if you don't specify any "nodes=3D", and you have some vcpu-affinity, this patch won't call any node_affinity setting routine, so that the default behavior of building it up upon vcpu-affinity will be retained. Actually, even with the following patches that introduces automatic placement, if no node-affinity is specified, while a vcpu-affinity is, the algorithm will try to place the domain where vcpu-affinity is saying. So, summarizing, I'll add some means of saying "all nodes", while I don't think anything is needed to say "do what vcpu-affinity wants", but I'm more than open to suggestions and alternatives if you think this is not good/clear enough. > > +/** > > + * This function explicitly sets the affinity of a domain with the > > + * host NUMA nodes. > > + * > > + * @parm xch a handle to an open hypervisor interface. > > + * @parm domid the domain id in which vcpus are to be created. > > + * @parm node_affinity the map of the affine nodes. > > + * @return 0 on success, -1 on failure. > > + */ > > +int xc_domain_node_affinity(xc_interface *xch, > > + uint32_t domind, > > + xc_nodemap_t node_affinity); > > + > Seems like it would be useful to have both a get and a set. > Ok. Not right now, but I agree it would make it a more sane interface. Will add a *_get_affinity (and change this to *_set_affinity) > > int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap= *cpumap) > > { > > GC_INIT(ctx); > > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > > --- a/tools/libxl/libxl.h > > +++ b/tools/libxl/libxl.h > > @@ -727,6 +727,8 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct > > libxl_cpumap *cpumap); > > int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, > > unsigned int max_vcpus, libxl_cpumap *= cpumap); > > +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid, > > + libxl_nodemap *nodemap); > Hmm -- is there really no libxl_get_vcpuaffinity()? That seems to be a= =20 > missing component... > Apparently not, seems like `xl vcpu-list' reaches out print_vcpuinfo() (xl.c:4268 here) which figures out thing by its own with something like this: /* CPU AFFINITY */ print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); Should we change this? > > + } > > + else { > > + /* > > + * XXX What would a sane default be? I think doing nothing > > + * (i.e., relying on cpu-affinity/cpupool =3D=3D> the cur= rent > > + * behavior) should be just fine. > > + * That would mean we're saying to the user "if you want u= s > > + * to take care of NUMA, please tell us, maybe just puttin= g > > + * 'nodes=3Dauto', but tell us... otherwise we do as usual= ". > > + */ > I think that for this patch, doing nothing is fine (which means removing= =20 > the whole else clause). But once we have the auto placement, I think=20 > that "nodes=3Dauto" should be the default. > Ok, that makes sense. I'm running benchmarks comparing the three different policies I've implemented so far (see patches 8 and 9), to understand if they make any sense at all and, if they do, which one would be better suited for being used as the default policy. Will post the results as soon as they finish. > > + case XEN_DOMCTL_node_affinity: > > + { > > + domid_t dom =3D op->domain; > > + struct domain *d =3D rcu_lock_domain_by_id(dom); > > + nodemask_t new_affinity; > > + > > + ret =3D -ESRCH; > > + if ( d =3D=3D NULL ) > > + break; > > + > > + /* XXX We need an xsm_* for this I guess, right? */ > Yes. :-) > Ok, I'll look into it and put it together for the next release of this series. I'm planning to take great inspiration from the one in *_vcpu_affinity, do you think that could be the right thing? > > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > > --- a/xen/include/xen/sched.h > > +++ b/xen/include/xen/sched.h > > @@ -346,8 +346,12 @@ struct domain > > /* Various mem_events */ > > struct mem_event_per_domain *mem_event; > > > > - /* Currently computed from union of all vcpu cpu-affinity masks. *= / > > + /* > > + * Can be specified by the user. If that is not the case, it is > > + * computed from the union of all the vcpu cpu-affinity masks. > > + */ > > nodemask_t node_affinity; > > + int has_node_affinity; > There's something that seems a bit clunky about this; but I can't really= =20 > think of a better way. At every least I'd change the name of the=20 > element here to something more descriptive; perhaps "auto_node_affinity"= =20 > (which would invert the meaning)? > You know what, I really don't like this either, but I need something that tells me whether I should compute the node affinity from vcpu-affinity/cpupool or the user has set that to something I shouldn't touch, and did not find any other way of doing such than adding this "flag". I kind of took inspiration from d->is_pinned although, yes, I know, that is used for different purposes. Any suggestion is welcome, in the meanwhile, I'll change the name and the semantic to what you're suggesting. Thanks and Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-LZ8ELMCpHrrBBTavRHqM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEABECAAYFAk+HVXEACgkQk4XaBE3IOsQzegCfbYTISuIT137BeUoVJTPy6uh8 aksAoKVWPyk0XPiY+3Cimw3u3/invBQc =nwxO -----END PGP SIGNATURE----- --=-LZ8ELMCpHrrBBTavRHqM-- --===============5653723866298172454== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============5653723866298172454==--