Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Dario Faggioli <raistlin@linux.it>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Date: Thu, 03 May 2012 16:58:51 +0200	[thread overview]
Message-ID: <1336057131.2313.82.camel@Abyss> (raw)
In-Reply-To: <4FA28B1B.8040205@eu.citrix.com>


[-- Attachment #1.1: Type: text/plain, Size: 8504 bytes --]

On Thu, 2012-05-03 at 14:41 +0100, George Dunlap wrote: 
> >> Why are "nodes_policy" and "num_nodes_policy" not passed in along with
> >> b_info?
> > That was my first implementation. Then I figured out that I want to do
> > the placement in _xl_, not in _libxl_, so I really don't need to muck up
> > build info with placement related stuff. Should I use b_info anyway,
> > even if I don't need these fields while in libxl?
> Ah right -- yeah, probably since b_info is a libxl structure, you 
> shouldn't add it in there.  But in that case you should probably add 
> another xl-specific structure and pass it through, rather than having 
> global variables, I think.  It's only used in the handful of placement 
> functions, right?
>
It is and yes, I can and will try to "pack" it better. Thanks.

> > - which percent should I try, and in what order? I mean, 75%/25%
> >     sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your
> >     point.
> I had in mind no constraints at all on the ratios -- basically, if you 
> can find N nodes such that the sum of free memory is enough to create 
> the VM, even 99%/1%, then go for that rather than looking for N+1.  
> Obviously finding a more balanced option would be better. 
>
I see, and this looks both reasonable and doable with reasonable
effort. :-)

> One option 
> would be to scan through finding all sets of N nodes that will satisfy 
> the criteria, and then choose the most "balanced" one.  That might be 
> more than we need for 4.2, so another option would be to look for evenly 
> balanced nodes first, then if we don't find a set, look for any set.  
> (That certainly fits with the "first fit" description!)
>
Yep, that would need some extra effort that we might want to postpone. I
like what you proposed above and I'll look into putting it together
without turning xl too much into a set-theory processing machine. ;-D

> > - suppose I go for 75%/25%, what about the scheduling oof the VM?
> Haha -- yeah, for a research paper, you'd probably implement some kind 
> of lottery scheduling algorithm that would schedule it on one node 75% 
> of the time and another node 25% of the time. :-)  
>
That would be awesome. I know a bunch of people that would manage in
getting at leat 3 or 4 publications out of this idea (one about the
idea, one about the implementation, one about the statistical
guarantees, etc.). :-P :-P

> But I think that just 
> making the node affinity equal to both of them will be good enough for 
> now.  There will be some variability in performance, but there will be 
> some of that anyway depending on what node's memory the guest happens to 
> use more.
>
Agreed. I'm really thinking about taking your suggestion about keeping
thee layout "fluid" and then set the affinity to all of the involved
nodes. Thanks again.

> This actually kind of a different issue, but I'll bring it up now 
> because it's related.  (Something to toss in for thinking about in 4.3 
> really.)  Suppose there are 4 cores and 16GiB per node, and a VM has 8 
> vcpus and 8GiB of RAM.  The algorithm you have here will attempt to put 
> 4GiB on each of two nodes (since it will take 2 nodes to get 8 cores).  
> However, it's pretty common for larger VMs to have way more vcpus than 
> they actually use at any one time.  So it might actually have better 
> performance to put all 8GiB on one node, and set the node affinity 
> accordingly.  In the rare event that more than 4 vcpus are active, a 
> handful of vcpus will have all remote accesses, but the majority of the 
> time, all of the cpus will have local accesses.  (OTOH, maybe that 
> should be only a policy thing that we should recommend in the 
> documentation...)
> 
That makes a lot of sense, and I was also thinking along the very same
line. Once we'll have a proper mechanism to account for CPU load we
could put another policing re this thing, i.e., once we've got all the
information about the memory, what should we do with respect to (v)cpus?
Should we spread the load or prefer some more "packed" placement? Again,
I think we can both add more tunables and also make things dynamic
(e.g., have all the 8 vcpus on the 4 cpus of a node and, if they're
generating too much remote accesses, expand the node affinity and move
some of the memory) when we'll have more of the mechanism in place.

> Dude, this is open source.  Be opinionated. ;-)
> 
Thanks for the advice... From now on, I'll do my best! :-)

> What do you think of my suggestions above?
>
As I said, I actually like it.

> I think if the user specifies a nodemap, and that nodemap doesn't have 
> enough memory, we should throw an error.
> 
Agreed.

> If there's a node_affinity set, no memory on that node, but memory on a 
> *different* node, what will Xen do?  It will allocate memory on some 
> other node, right?
> 
Yep.

> So ATM even if you specify a cpumask, you'll get memory on the masked 
> nodes first, and then memory elsewhere (probably in a fairly random 
> manner); but as much of the memory as possible will be on the masked 
> nodes.  I wonder then if we shouldnt' just keep that behavior -- i.e., 
> if there's a cpumask specified, just return the nodemask from that mask, 
> and let Xen put as much as possible on that node and let the rest fall 
> where it may.
> 
Well, sounds reasonable after all, and I'm pretty sure it would simplify
the code a bit, which is something not bad at all. So yes, I'm inclined
to agree on this.

> > However, if what you mean is I could check beforehand whether or not the
> > user provided configuration will give us enough CPUs and avoid testing
> > scenarios that are guaranteed to fail, then I agree and I'll reshape the
> > code to look like that. This triggers the heuristics re-designing stuff
> > from above again, as one have to decide what to do if user asks for
> > "nodes=[1,3]" and I discover (earlier) that I need one more node for
> > having enough CPUs (I mean, what node should I try first?).
> No, that's not exactly what I meant.  Suppose there are 4 cores per 
> node, and a VM has 16 vcpus, and NUMA is just set to auto, with no other 
> parameters.  If I'm reading your code right, what it will do is first 
> try to find a set of 1 node that will satisfy the constraints, then 2 
> nodes, then 3, nodes, then 4, &c.  Since there are at most 4 cores per 
> node, we know that 1, 2, and 3 nodes are going to fail, regardless of 
> how much memory there is or how many cpus are offline.  So why not just 
> start with 4, if the user hasn't specified anything?  Then if 4 doesn't 
> work (either because there's not enough memory, or some of the cpus are 
> offline), then we can start bumping it up to 5, 6, &c.
> 
Fine, I got it now, thanks for clarifying (both here and in the other
e-mail). I'll look into that and, if it doesn't cost too much
thinking/coding/debugging time, I'll go for it.

> That's what I was getting at -- but again, if it makes it too 
> complicated, trading a bit of extra passes for a significant chunk of 
> your debugging time is OK. :-)
> 
I'll sure act like this. I really want this thing out ASAP.

> Well, if the user didn't specify anything, then we can't contradict 
> anything he specified, right? :-)  If the user doesn't specify anything, 
> and the default is "numa=auto", then I think we're free to do whatever 
> we think is best regarding NUMA placement; in fact, I think we should 
> try to avoid failing VM creation if it's at all possible.  I just meant 
> what I think we should do if the user asked for specific NUMA nodes or a 
> specific number of nodes.
>
I agree, and it should be easy enough to discern in which of these two
situations we are and act accordingly.

> It seems like we have a number of issues here that would be good for 
> more people to come in on -- 
>
This is something I really would enjoy a lot!

> what if I attempt to summarize the 
> high-level decisions we're talking about so that it's easier for more 
> people to comment on them?
> 
That would be very helpful, indeed.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2012-05-03 14:58 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap
2012-05-03 14:58         ` Dario Faggioli [this message]
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1336057131.2313.82.camel@Abyss \
    --to=raistlin@linux.it \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).