Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Date: Tue, 1 May 2012 16:45:35 +0100	[thread overview]
Message-ID: <4FA0051F.6020909@eu.citrix.com> (raw)
In-Reply-To: <31163f014d8a2da9375f.1334150275@Solace>

Hey Dario,

Thanks for the work on this -- this is obviously a very complicated area 
to try to make sense of.  Of course, that makes it even more complicated 
to review -- sorry this has taken so long.  (comments inline)

On 11/04/12 14:17, Dario Faggioli wrote:
> +
> +=item B<auto>
> +
> +use a not better specified (xl-implementation dependant) algorithm
> +to try automatically fitting the guest on the host's NUMA nodes
Uum, you mean, "Use the default placement algorithm"? :-)  We should 
specify which one this will be here.
> +
> +/* Automatic placement policies symbols, with the following meaning:
> + *  NONE   : no automatic placement at all;
> + *  STATIC : explicit nodes specification.
> + *  FFIT   : use the First Fit algorithm for automatic placement;
> + *  AUTO   : an not better specified automatic placement, likely
> + *           to be implemented as a short circuit to (one of)
> + *           the above(s);
> + */
> +#define NODES_POLICY_NONE    0
> +#define NODES_POLICY_STATIC  1
> +#define NODES_POLICY_FFIT    2
> +#define NODES_POLICY_AUTO    3
This is minor, but it seems like "auto" should be 1, so if we add 
another policy, the policies are all together, without having to move 
AUTO around every time.
> +
> +/* Store the policy for the domain while parsing */
> +static int nodes_policy = NODES_POLICY_DEFAULT;
> +
> +/* Store the number of nodes to be used while parsing */
> +static int num_nodes_policy = 0;
Why are "nodes_policy" and "num_nodes_policy" not passed in along with 
b_info?
> +static int nodes_policy_parse(const char *pol, long int *policy)
> +{
> +    int i;
> +    const char *n;
> +
> +    for (i = 0; i<  sizeof(nodes_policy_names) / sizeof(nodes_policy_names[0]); i++) {
I personally prefer an explicit NODES_POLICY_MAX, but I'll let the libxl 
maintainers decide on that.
> +
> +        /* Determine how much free memory we want on each of the nodes
> +         * that will end up in suitable_nodes. Either adding or ignoring
> +         * the modulus of the integer division should be fine (the total
> +         * number of nodes should be in the order of tens of them), so
> +         * let's add it as it should be more safe.
> +         */
> +        memb_node = (memb / (*nodes)) + (memb % (*nodes));
Wouldn't it just be easier to do a "round-up" function here, like this:
  memb_node = ( (memb + *nodes -1) / (*nodes) )

Also, is it really necessary for a VM to have an equal amount of memory 
on every node? It seems like it would be better to have 75% on one node 
and 25% on a second node than to have 25% on four nodes, for example.  
Furthermore, insisting on an even amount fragments the node memory 
further -- i.e., if we chose to have 25% on four nodes instead of 75% on 
one and 25% on another, that will make it harder for another VM to fit 
on a single node as well.

The need for NODES_POLICY_RETRY_DEC is an artifact of the above; if 
nodes were allowed to be assymetric, you wouldn't need to *decrease* the 
number of nodes to find *more* memory.
> +        /* Apply the asked retry policy for the next round. Notice
> +         * that it would be pointless to increase nodes without also
> +         * adding some nodes to the map! */
> +        *nodes += retry;
> +        if (retry == NODES_POLICY_RETRY_INC)
> +            __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry);
Hmm -- if I'm reading this right, the only time the nodemap won't be all 
nodes is if (1) the user specified nodes, or (2) there's a cpumask in 
effect.  If we're going to override that setting, wouldn't it make sense 
to just expand to all numa nodes?

Hmm -- though I suppose what you'd really want to try is adding each 
node in turn, rather than one at a time (i.e., if the cpus are pinned to 
nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before 
trying [1,2,3,4].  But that's starting to get really complicated -- I 
wonder if it's better to just fail and let the user change the pinnings 
/ configuration node mapping.
> +
> +    /* If a nodemap has been specified, just comply with that.
> +     * OTOH, if there's no nodemap, rely on cpumap (if any), and
> +     * fallback to all node if even that is empty. Also consider
> +     * all the nodes to be valid if only the number (i.e., _how_
> +     * _many_ of them instead of _which_ of them) of nodes the
> +     * domain wants is provided.
I feel like this comment could be made more clear...
> +     *
> +     * Concering retries, if a specific set of nodes is specified,
> +     * just try that and give up if we fail. If, instead, a set of
> +     * CPUs onto which to pin the domain is specified, try first
> +     * the exact nodes those CPUs belongs to, then also try both
> +     * a smaller and a bigger number of nodes. The same happens if
> +     * we just have the number of nodes to be used. Finally, if
> +     * there is no node-affinity, no cpu-pinning, no number of nodes,
> +     * start trying with one node and increase at the configured
> +     * rate (considering all the nodes to be suitable).
This should be above the "do {} while()" loop below.  (Also, see comment 
on increasing node map above.)
> +
> +        if (use_cpus>= b_info->max_vcpus) {
> +            rc = 0;
> +            break;
> +        }
Hmm -- there's got to be a better way to find out the minimum number of 
nodes to house a given number of vcpus than just starting at 1 and 
re-trying until we have enough.
> +        /* Add one more node and retry fitting the domain */
> +        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
Same comment as above.
>
> +    ret = place_domain(&d_config.b_info);
> +    if (ret == ESRCH || ret == ENOSPC) {
> +        fprintf(stderr, "failed to place the domain, fallback to all nodes\n");
> +        libxl_nodemap_set_any(&d_config.b_info.nodemap);
Hmm -- is this the right fallback?  I think in general, if the user asks 
for X to be done, and X cannot be done for whatever reason, the right 
thing is to tell the user that X cannot be done and let them sort it 
out, rather than resorting to Y (unless that's been set).

Is it the case that if any node placement fails, then they'll all fail?  
Or to put it the other way, is it the case that if there is *some* 
placement, that any placement algorithm will eventually find it?  If so, 
then I think this fallback may make sense, as there's nothing for the 
user to really do.
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -365,10 +365,11 @@ static void dump_numa(unsigned char key)
>
>          for_each_online_node(i) {
>                  paddr_t pa = (paddr_t)(NODE_DATA(i)->node_start_pfn + 1)<<  PAGE_SHIFT;
> -               printk("idx%d ->  NODE%d start->%lu size->%lu\n",
> +               printk("idx%d ->  NODE%d start->%lu size->%lu free->%lu\n",
>                            i, NODE_DATA(i)->node_id,
>                            NODE_DATA(i)->node_start_pfn,
> -                         NODE_DATA(i)->node_spanned_pages);
> +                         NODE_DATA(i)->node_spanned_pages,
> +                          avail_node_heap_pages(i));
>                  /* sanity check phys_to_nid() */
>                  printk("phys_to_nid(%"PRIpaddr") ->  %d should be %d\n", pa, phys_to_nid(pa),
>                            NODE_DATA(i)->node_id);
This should be in its own patch.

   -George

next prev parent reply	other threads:[~2012-05-01 15:45 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap [this message]
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap
2012-05-03 14:58         ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FA0051F.6020909@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=raistlin@linux.it \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).