Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Date: Thu, 3 May 2012 14:41:47 +0100	[thread overview]
Message-ID: <4FA28B1B.8040205@eu.citrix.com> (raw)
In-Reply-To: <1335976251.2961.123.camel@Abyss>

On 02/05/12 17:30, Dario Faggioli wrote:
>>> +
>>> +/* Store the policy for the domain while parsing */
>>> +static int nodes_policy = NODES_POLICY_DEFAULT;
>>> +
>>> +/* Store the number of nodes to be used while parsing */
>>> +static int num_nodes_policy = 0;
>> Why are "nodes_policy" and "num_nodes_policy" not passed in along with
>> b_info?
>>
> That was my first implementation. Then I figured out that I want to do
> the placement in _xl_, not in _libxl_, so I really don't need to muck up
> build info with placement related stuff. Should I use b_info anyway,
> even if I don't need these fields while in libxl?
Ah right -- yeah, probably since b_info is a libxl structure, you 
shouldn't add it in there.  But in that case you should probably add 
another xl-specific structure and pass it through, rather than having 
global variables, I think.  It's only used in the handful of placement 
functions, right?
> Sounds definitely nicer. I just did it like that because I found a very
> similar example in xl itself, but I'm open about changing this to
> whatever you and libxl maintainers reach a consensus on. :-)
Right.  This is always a bit tricky, balancing your own taste for how to 
do things, and following the style of the code that you're modifying.
>> Also, is it really necessary for a VM to have an equal amount of memory
>> on every node? It seems like it would be better to have 75% on one node
>> and 25% on a second node than to have 25% on four nodes, for example.
>> Furthermore, insisting on an even amount fragments the node memory
>> further -- i.e., if we chose to have 25% on four nodes instead of 75% on
>> one and 25% on another, that will make it harder for another VM to fit
>> on a single node as well.
>>
> Ok, that is something quite important to discuss. What you propose makes
> a lot of sense, although some issues comes to my mind:
>
> - which percent should I try, and in what order? I mean, 75%/25%
>     sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your
>     point.
I had in mind no constraints at all on the ratios -- basically, if you 
can find N nodes such that the sum of free memory is enough to create 
the VM, even 99%/1%, then go for that rather than looking for N+1.  
Obviously finding a more balanced option would be better. One option 
would be to scan through finding all sets of N nodes that will satisfy 
the criteria, and then choose the most "balanced" one.  That might be 
more than we need for 4.2, so another option would be to look for evenly 
balanced nodes first, then if we don't find a set, look for any set.  
(That certainly fits with the "first fit" description!)
> - suppose I go for 75%/25%, what about the scheduling oof the VM?
Haha -- yeah, for a research paper, you'd probably implement some kind 
of lottery scheduling algorithm that would schedule it on one node 75% 
of the time and another node 25% of the time. :-)  But I think that just 
making the node affinity equal to both of them will be good enough for 
now.  There will be some variability in performance, but there will be 
some of that anyway depending on what node's memory the guest happens to 
use more.

This actually kind of a different issue, but I'll bring it up now 
because it's related.  (Something to toss in for thinking about in 4.3 
really.)  Suppose there are 4 cores and 16GiB per node, and a VM has 8 
vcpus and 8GiB of RAM.  The algorithm you have here will attempt to put 
4GiB on each of two nodes (since it will take 2 nodes to get 8 cores).  
However, it's pretty common for larger VMs to have way more vcpus than 
they actually use at any one time.  So it might actually have better 
performance to put all 8GiB on one node, and set the node affinity 
accordingly.  In the rare event that more than 4 vcpus are active, a 
handful of vcpus will have all remote accesses, but the majority of the 
time, all of the cpus will have local accesses.  (OTOH, maybe that 
should be only a policy thing that we should recommend in the 
documentation...)

> Please, don't get me wrong, I see your point and really think it makes
> sense. I've actually thought along the same line for a while, but then I
> couldn't find an answers to the questions above.
>
> That's why, kind of falling back with Xen's default "striped" approach
> (although on as less nodes as possible, which is _much_ better than the
> actual Xen's default!). It looked simple enough to write, read and
> understand, while still providing statistically consistent performances.
Dude, this is open source.  Be opinionated. ;-)

What do you think of my suggestions above?
>> Hmm -- if I'm reading this right, the only time the nodemap won't be 
>> all nodes is if (1) the user specified nodes, or (2) there's a 
>> cpumask in effect. If we're going to override that setting, wouldn't 
>> it make sense to just expand to all numa nodes? 
> As you wish, the whole "what to do if what I've been provided with
> doesn't work" is in the *wild guess* status, meaning I tried to figure
> out what would be best to do, but I might well be far from the actual
> correct solution, provided there is one.
>
> Trying to enlarge the nodemap step by step is potentially yielding
> better performances, but is probably not so near to the "least surprise"
> principle one should use when designing UIs. :-(
>
>> Hmm -- though I suppose what you'd really want to try is adding each
>> node in turn, rather than one at a time (i.e., if the cpus are pinned to
>> nodes 2 and 3, and [2,3] doesn't work, try [1,2,3] and [2,3,4] before
>> trying [1,2,3,4].
>>
> Yep, that makes a real lot of sense, thanks! I can definitely try doing
> that, although it will complicate the code a bit...
>
>> But that's starting to get really complicated -- I
>> wonder if it's better to just fail and let the user change the pinnings
>> / configuration node mapping.
>>
> Well, that will probably be the least surprising behaviour.
>
> Again, just let me know what you think it's best among the various
> alternatives and I'll go for it.
I think if the user specifies a nodemap, and that nodemap doesn't have 
enough memory, we should throw an error.

If there's a node_affinity set, no memory on that node, but memory on a 
*different* node, what will Xen do?  It will allocate memory on some 
other node, right?

So ATM even if you specify a cpumask, you'll get memory on the masked 
nodes first, and then memory elsewhere (probably in a fairly random 
manner); but as much of the memory as possible will be on the masked 
nodes.  I wonder then if we shouldnt' just keep that behavior -- i.e., 
if there's a cpumask specified, just return the nodemask from that mask, 
and let Xen put as much as possible on that node and let the rest fall 
where it may.

What do you think?
>>> +
>>> +        if (use_cpus>= b_info->max_vcpus) {
>>> +            rc = 0;
>>> +            break;
>>> +        }
>> Hmm -- there's got to be a better way to find out the minimum number of
>> nodes to house a given number of vcpus than just starting at 1 and
>> re-trying until we have enough.
>>> +        /* Add one more node and retry fitting the domain */
>>> +        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
>> Same comment as above.
>>
> I'm not sure I'm getting this. The whole point here is let's consider
> free memory on the various nodes first, and then adjust the result if
> some other constraints are being violated.
Sorry, wrong above -- I meant the other comment about 
__add_nodes_to_nodemap(). :-)
> However, if what you mean is I could check beforehand whether or not the
> user provided configuration will give us enough CPUs and avoid testing
> scenarios that are guaranteed to fail, then I agree and I'll reshape the
> code to look like that. This triggers the heuristics re-designing stuff
> from above again, as one have to decide what to do if user asks for
> "nodes=[1,3]" and I discover (earlier) that I need one more node for
> having enough CPUs (I mean, what node should I try first?).
No, that's not exactly what I meant.  Suppose there are 4 cores per 
node, and a VM has 16 vcpus, and NUMA is just set to auto, with no other 
parameters.  If I'm reading your code right, what it will do is first 
try to find a set of 1 node that will satisfy the constraints, then 2 
nodes, then 3, nodes, then 4, &c.  Since there are at most 4 cores per 
node, we know that 1, 2, and 3 nodes are going to fail, regardless of 
how much memory there is or how many cpus are offline.  So why not just 
start with 4, if the user hasn't specified anything?  Then if 4 doesn't 
work (either because there's not enough memory, or some of the cpus are 
offline), then we can start bumping it up to 5, 6, &c.

That's what I was getting at -- but again, if it makes it too 
complicated, trading a bit of extra passes for a significant chunk of 
your debugging time is OK. :-)

> So, I'm not entirely sure I answered your question but the point is your
> idea above is the best one: if you ask something and we don't manage in
> getting it done, just stop and let you figure things out.
> I've only one question about this approach, what if the automatic
> placement is/becomes the default? I mean, avoiding any kind of fallback
> (which again, makes sense to me in case the user is explicitly asking
> something specific) would mean a completely NUMA-unaware VM creation can
> be aborted even if the user did not say anything... How do we deal with
> this?
Well, if the user didn't specify anything, then we can't contradict 
anything he specified, right? :-)  If the user doesn't specify anything, 
and the default is "numa=auto", then I think we're free to do whatever 
we think is best regarding NUMA placement; in fact, I think we should 
try to avoid failing VM creation if it's at all possible.  I just meant 
what I think we should do if the user asked for specific NUMA nodes or a 
specific number of nodes.  (I think that cpu masks should probably 
behave as it does now -- set the numa_affinity, but not fail domain 
creation if there's not enough memory on those nodes.)

It seems like we have a number of issues here that would be good for 
more people to come in on -- what if I attempt to summarize the 
high-level decisions we're talking about so that it's easier for more 
people to comment on them?

  -George

>
>>> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
>>> --- a/xen/arch/x86/numa.c
>>> +++ b/xen/arch/x86/numa.c
>>> ...
>>>
>> This should be in its own patch.
>>
> Ok.
>
> Thanks  lot again for taking a look!
>
> Regards,
> Dario
>

next prev parent reply	other threads:[~2012-05-03 13:41 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap [this message]
2012-05-03 14:58         ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FA28B1B.8040205@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=raistlin@linux.it \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).