xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Dario Faggioli <raistlin@linux.it>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
Date: Thu, 21 Jun 2012 18:34:12 +0200	[thread overview]
Message-ID: <1340296452.4856.87.camel@Solace> (raw)
In-Reply-To: <1340278832.21872.112.camel@zakaz.uk.xensource.com>


[-- Attachment #1.1: Type: text/plain, Size: 14938 bytes --]

On Thu, 2012-06-21 at 12:40 +0100, Ian Campbell wrote:
> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> > --- a/docs/man/xl.cfg.pod.5
> > +++ b/docs/man/xl.cfg.pod.5
> > @@ -111,8 +111,8 @@ created online and the remainder will be
> > 
> >  =item B<cpus="CPU-LIST">
> > 
> > -List of which cpus the guest is allowed to use. Default behavior is
> 
> There is (according to grep) a slight preference for British spelling,
> but I imagine we are totally inconsistent all over the place so lets not
> worry too much ;-)
> 
Especially considering I'm removing it! :-P (However, I'm sure it was me
that put that line in some time ago, so I'll make a note about
this! :-D).

> > @@ -132,6 +132,12 @@ run on cpu #3 of the host.
> > 
> >  =back
> > 
> > +If this option is not specified, libxl automatically tries to place the new
> > +domain on the host's NUMA nodes (provided the host has more than one NUMA
> > +node) by pinning it to the cpus of those nodes. An heuristical approach is
> 
> I don't think heuristical is a word, I think "A heuristic approach..."
> is fine
> 
Fixed (with "A heuristic approach"). Thanks.

> > + * the same instance of the iterator and the same values for
> > + * n and k are used for each call. If that doesn't happen, the
> > + * result is unspecified.
> > + *
> > + * The algorithm is a well known one,
> 
> Worth a link/reference?
> 
As you like... The general idea is described, for example, in this
Wikipedia page http://en.wikipedia.org/wiki/Combination , the actual
algorithm appeared, for instance, in D. Knuth's "The Art of Computer
Programming - Volume 4, Fascicle 3", but in many other places as well. I
guess I can cite TAOCP, although I'm not sure it is actually the first,
and thus the one who should be credited for this... But I guess that
doesn't matter too much, does it?

> > + *
> > + * For example, with n = 5 and k = 3, calling comb_init()
> > + * will generate { 0, 1, 2 }, while subsequent valid calls to
> > + * comb_next() will produce the following:
> > + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },
> 
>                 ^ strictly speaking this 0,1,2 isn't produced by a call
>                 to comb_next since it came from the initial comb_init,
>                 right? 
Yes, that's right.

> IOW the first comb_next() call won't duplicate
>                 it...
> 
No, it won't. I'll fix this in the comment.

> I'm not going to try very hard to review the algorithm here, I trust you
> and I've got a head cold ;-)
> 
Fine. :-)

> > +/* NUMA automatic placement (see libxl_internal.h for details) */
> > +
> > +/*
> > + * This function turns a k-combination iterator into a node map.
> > + * This means the bits in the node map corresponding to the indexes
> > + * of the given combination are the ones that will be set.
> > + * For example, if the iterator represents the combination { 0, 2, 4},
> > + * the node map will have bits #0, #2 and #4 set to one and i thus
> > + * will point to node 0, node 2 and and node 4.
> 
> What is "i" in this last bit ("i thus will point to")? I don't think you
> are referring to the loop iterator variable.
> 
> If you meant "it" then I don't think it needs saying that a node map
> with bits 0, 2 and 4 set refers to nodes 0, 2 and 4.
> 
It was actually "it", and I'm happy to kill that part of the sentence.

> > +/* Retrieve the number of cpus that the nodes that are part of the nodemap
> > + * span. */
> > +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
> > +                                 const libxl_bitmap *nodemap)
> 
> nodes here == node's ? But can't have an apostrophe in a function which
> makes it ready weird to me. Perhaps "nodemap_count_cpus"?
> 
Right. I wanted to stick to the nodemap_to_xxx() for those group of
function, so I changed this one into "nodemap_to_nr_cpus()". Hope it is
fine...

> > +/* Retrieve the number of domains that can potentially run on the cpus
> > + * the nodes that are part of the nodemap. */
> > +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
> > +                                 const libxl_bitmap *nodemap)
> > +{
> > +    libxl_dominfo *dinfo = NULL;
> > +    libxl_bitmap dom_nodemap;
> > +    int nr_doms, nr_cpus;
> > +    int nr_domains = 0;
> > +    int i, j, k;
> > +
> > +    dinfo = libxl_list_domain(CTX, &nr_doms);
> > +    if (dinfo == NULL)
> > +        return ERROR_FAIL;
> > +
> > +    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) {
> > +        nr_domains = ERROR_FAIL;
> > +        goto out;
> > +    }
> > +
> > +    for (i = 0; i < nr_doms; i++) {
> > +        libxl_vcpuinfo *vinfo;
> > +        int nr_vcpus;
> > +
> > +        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus);
> > +        if (vinfo == NULL)
> > +            continue;
> > +
> > +        libxl_bitmap_set_none(&dom_nodemap);
> > +        for (j = 0; j < nr_vcpus; j++) {
> > +            libxl_for_each_set_bit(k, vinfo[j].cpumap)
> > +                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
> > +        }
> > +
> > +        libxl_for_each_set_bit(j, dom_nodemap) {
> > +            if (libxl_bitmap_test(nodemap, j)) {
> > +                nr_domains++;
> > +                goto found;
> 
> AKA break?
> 
Yes. Leftover from v1, which I forgot to adapt.
> > +            }
> > +        }
> > + found:
> > +        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
> > +    }
> > +
> > + out:
> > +    libxl_bitmap_dispose(&dom_nodemap);
> 
> You can come to out if libxl_nodebitmap_alloc fails -- in which case
> dom_nodemap is (potentially) uninitialised.
>
> You could just move out: down a line.
> 
Right. As I did not really like that "nr_domains = ERROR_FAIL" I changed
the error handling style of this function as a whole so that now I do
not need the out: label any more.

> > +/*
> > + * This function tries to figure out if the host has a consistent number
> > + * of cpus along all its NUMA nodes. In fact, if that is the case, we can
> > + * calculate the minimum number of nodes needed for a domain by just
> > + * dividing its total number of vcpus by this value computed here.
> > + * However, we are not allowed to assume that all the nodes have the
> > + * same number of cpus. Therefore, in case discrepancies among different
> > + * nodes are found, this function just returns 0 and the caller needs
> > + * to sort out things in some other way.
> 
> If the caller has to deal with this anyway why bother with this check
> and/or the other algorithm?
> 
Mmm... I'm not sure I get what you mean here. Perhaps my commenting
about the whole thing was not so clear in the first place.

The (George's :-)) idea is, if we know each node has 8 CPUs, and our
domain wants more than that, it does not make sense to start looking for
placement solutions with just one node. Actually, if the domain wants
fewer than 16 CPUs, starting looking from solutions with 2 nodes in them
should be the right thing.

That being said, I think it's important we do not assume we won't ever
find some architecture with different number of CPUs in different nodes,
and that is what a zero return from that function means.

What was that you thought not to be worthwhile?

> > +/* Get all the placement candidates satisfying some specific conditions */
> > +int libxl__get_numa_candidates(libxl__gc *gc,
> > +                               uint32_t min_free_memkb, int min_cpus,
> > +                               int min_nodes, int max_nodes,
> > +                               libxl__numa_candidate *cndts[], int *nr_cndts)
> > +{
> > +    libxl__numa_candidate *new_cndts = NULL;
> > +    libxl_cputopology *tinfo = NULL;
> > +    libxl_numainfo *ninfo = NULL;
> > +    libxl_bitmap nodemap;
> > +    int nr_nodes, nr_cpus;
> > +    int array_size, rc;
> > +
> > +    /* Get platform info and prepare the map for testing the combinations */
> > +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> > +    if (ninfo == NULL)
> > +        return ERROR_FAIL;
> > +    /* If we don't have at least 2 nodes, it is useless to proceed */
> 
> You don't want to return whichever of those 2 nodes meets the
> constraints?
> 
I actually was meaning something like "hey, there's only _1_ node, go
for it!" :-P

> ...
> > +    *cndts = new_cndts;
> > + out:
> > +    libxl_bitmap_dispose(&nodemap);
> 
> nodemap might not have been initialised?
>
Mmm... Right. So I really think I need an init function. :-(

> > +    libxl_cputopology_list_free(tinfo, nr_cpus);
> > +    libxl_numainfo_list_free(ninfo, nr_nodes);
> 
> It looks like the nr_* may also not have been initialised, but maybe
> list_free is safe in that case since the associated array must
> necessarily be NULL?
> 
Well, probably, but as they both do a "for (i=0;i<nr_*;i++)" I think
it's safer if I "=0" those twos, is that ok?

> > +/*
> > + * The NUMA placement candidates are reordered according to the following
> > + * heuristics:
> > + *  - candidates involving fewer nodes come first. In case two (or
> > + *    more) candidates span the same number of nodes,
> > + *  - candidates with greater amount of free memory come first. In
> > + *    case two (or more) candidates differ in their amount of free
> > + *    memory by less than 10%,
> > + *  - candidates with fewer domains insisting on them at the time of
> > + *    this call come first.
> > + */
> > +static int numa_cmpf(const void *v1, const void *v2)
> > +{
> > +    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1;
> > +    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;
> 
> Are these casts necessary?
> 
Will check and remove if not. Thanks.

> > +    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
> > +    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
> > +
> > +    if (c1->nr_nodes != c2->nr_nodes)
> > +        return c1->nr_nodes - c2->nr_nodes;
> > +
> > +    if ((mem_diff / mem_avg) * 100.0 < 10.0 &&
> 
> Is all this FP math necessary? Using integers might have some rounding
> but is it significant or do we care if it gets it slightly wrong?
> 
I think integer is fine... It's heuristics, after all. I'll go for it.

> > +/* The actual automatic NUMA placement routine */
> > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> > +{
> > +    int nr_candidates = 0;
> > +    libxl__numa_candidate *candidates = NULL;
> > +    libxl_bitmap candidate_nodemap;
> > +    libxl_cpupoolinfo *pinfo;
> > +    int nr_pools, rc = 0;
> > +    uint32_t memkb;
> > +
> > +    /* First of all, if cpupools are in use, better not to mess with them */
> > +    pinfo = libxl_list_cpupool(CTX, &nr_pools);
> > +    if (!pinfo)
> > +        return ERROR_FAIL;
> > +    if (nr_pools > 1) {
> > +        LOG(NOTICE, "skipping NUMA placement as cpupools are in use");
> > +        goto out;
> > +    }
> > +
> > +    rc = libxl_domain_need_memory(CTX, info, &memkb);
> > +    if (rc)
> > +        goto out;
> > +    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
> > +        rc = ERROR_FAIL;
> > +        goto out;
> > +    }
> > +
> > +    /* Find all the candidates with enough free memory and at least
> > +     * as much pcpus as the domain has vcpus.  */
> > +    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
> > +                                    &candidates, &nr_candidates);
> > +    if (rc)
> > +        goto out;
> > +
> > +    LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
> > +
> > +    /* No suitable placement candidates. We just return without touching the
> > +     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> > +    if (nr_candidates == 0)
> > +        goto out;
> > +
> > +    /* Bring the best candidate in front of the list --> candidates[0] */
> > +    if (nr_candidates > 1)
> > +        libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
> 
> Is the start up cost of libxl__sort_numa_candidates significant enough
> to make that if worthwhile?
> 
I'm not sure what you mean here, I'm afraid ... :-(

> > +
> > +    /*
> > +     * Check if the domain has any CPU affinity. If not, try to build up one.
> > +     * In case numa_place_domain() find at least a suitable candidate, it will
> > +     * affect info->cpumap accordingly; if it does not, it just leaves it
> > +     * as it is. This means (unless some weird error manifests) the subsequent
> > +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> > +     * whatever that turns out to be.
> > +     */
> > +    if (libxl_bitmap_is_full(&info->cpumap)) {
> > +        int rc = numa_place_domain(gc, info);
> > +        if (rc)
> > +            return rc;
> 
> I'm not sure about this, if numa_place_domain fails for any reason would
> we be not better off logging and continuing without placement? We
> already do that explicitly in a couple of cases.
> 
Well, actually, if it "fails" in the sense it finds no candidates or
does not manage in placing the domain, it returns 0, so we do right what
you're suggesting. I'm sure this is stated in some comment but I guess I
can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or
something, which I think has to be handled like this, hasn't it?

Perhaps I can also add some logging about "successfully failing", i.e.,
not getting into any error but not being able to place the domain. If
yes, that will happen _inside_ numa_place_domain() rather than here.

> > + * dynamically computed. A candidate can consist of just one, up to all the
> > + * available nodes on the host (the latter being, of course, not ideal).
> 
> I can't parse this sentence.
> 
Sorry. I just meant a placement candidate is a set of nodes, so it can
be one node, four nodes, or even comprise all the available nodes.

> > +/* signature for the comparison function between two candidates c1 and c2
> > + * (the thid parameter is provided to enable thread safety). */
> 
>            third
> 
> But there isn't actually a third parameter so it's a bit moot ;-)
> 
> > +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
>
You're right. That is a leftover from where I was trying to use
qsort_r() for sorting, but it turned out to not be necessary in this
case. Killed.

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2012-06-21 16:34 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-15 17:04 [PATCH 00 of 10 v2] Automatic NUMA placement for xl Dario Faggioli
2012-06-15 17:04 ` [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro Dario Faggioli
2012-06-21  8:53   ` Ian Campbell
2012-06-26 16:00     ` Ian Campbell
2012-06-26 16:26       ` Dario Faggioli
2012-06-15 17:04 ` [PATCH 02 of 10 v2] libxl: add a new Array type to the IDL Dario Faggioli
2012-06-15 17:04 ` [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo() Dario Faggioli
2012-06-21  9:02   ` Ian Campbell
2012-06-21 10:00     ` Dario Faggioli
2012-06-21 10:21       ` Ian Campbell
2012-06-15 17:04 ` [PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n' Dario Faggioli
2012-06-21  9:04   ` Ian Campbell
2012-06-15 17:04 ` [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap Dario Faggioli
2012-06-21  9:12   ` Ian Campbell
2012-06-21  9:49     ` Dario Faggioli
2012-06-21 10:22       ` Ian Campbell
2012-06-15 17:04 ` [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit Dario Faggioli
2012-06-21  9:30   ` Ian Campbell
2012-06-21  9:46     ` Dario Faggioli
2012-06-15 17:04 ` [PATCH 07 of 10 v2] libxl: introduce some node map helpers Dario Faggioli
2012-06-21  9:35   ` Ian Campbell
2012-06-21  9:44     ` Dario Faggioli
2012-06-15 17:04 ` [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes Dario Faggioli
2012-06-21 11:40   ` Ian Campbell
2012-06-21 16:34     ` Dario Faggioli [this message]
2012-06-22 10:14       ` Ian Campbell
2012-06-26 16:25         ` Dario Faggioli
2012-06-26 16:26           ` Ian Campbell
2012-06-26 17:23             ` Ian Jackson
2012-06-21 16:16   ` George Dunlap
2012-06-21 16:43     ` Dario Faggioli
2012-06-22 10:05       ` George Dunlap
2012-06-26 11:03         ` Ian Jackson
2012-06-26 15:20           ` Dario Faggioli
2012-06-27  8:15           ` Dario Faggioli
2012-06-28  7:25   ` Zhang, Yang Z
2012-06-28  8:36     ` George Dunlap
2012-06-29  5:38       ` Zhang, Yang Z
2012-06-29  9:46         ` Dario Faggioli
2012-06-28 10:12     ` Dario Faggioli
2012-06-28 12:41       ` Pasi Kärkkäinen
2012-06-28 17:03         ` Dario Faggioli
2012-06-29  5:29           ` Zhang, Yang Z
2012-06-29  9:38             ` Dario Faggioli
2012-06-15 17:04 ` [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools Dario Faggioli
2012-06-21 13:31   ` Ian Campbell
2012-06-21 13:54     ` Dario Faggioli
2012-06-21 13:58       ` Ian Campbell
2012-06-15 17:04 ` [PATCH 10 of 10 v2] Some automatic NUMA placement documentation Dario Faggioli
2012-06-18 15:54   ` Dario Faggioli
2012-06-21 13:38   ` Ian Campbell
2012-06-21 13:57     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1340296452.4856.87.camel@Solace \
    --to=raistlin@linux.it \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).