From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 10 of 10 v3] Some automatic NUMA placement documentation Date: Fri, 6 Jul 2012 15:08:35 +0100 Message-ID: <4FF6F163.6010300@eu.citrix.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Roger Pau Monne List-Id: xen-devel@lists.xenproject.org On 04/07/12 17:18, Dario Faggioli wrote: > # HG changeset patch > # User Dario Faggioli > # Date 1341416324 -7200 > # Node ID f1523c3dc63746e07b11fada5be3d461c3807256 > # Parent 885e2f385601d66179058bfb6bd3960f17d5e068 > Some automatic NUMA placement documentation > > About rationale, usage and (some small bits of) API. > > Signed-off-by: Dario Faggioli > Acked-by: Ian Campbell > > Changes from v1: > * API documentation moved close to the actual functions. > > diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown > new file mode 100644 > --- /dev/null > +++ b/docs/misc/xl-numa-placement.markdown > @@ -0,0 +1,91 @@ > +# Guest Automatic NUMA Placement in libxl and xl # > + > +## Rationale ## > + > +NUMA means the memory accessing times of a program running on a CPU depends on > +the relative distance between that CPU and that memory. In fact, most of the > +NUMA systems are built in such a way that each processor has its local memory, > +on which it can operate very fast. On the other hand, getting and storing data > +from and on remote memory (that is, memory local to some other processor) is > +quite more complex and slow. On these machines, a NUMA node is usually defined > +as a set of processor cores (typically a physical CPU package) and the memory > +directly attached to the set of cores. > + > +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by > +assigning to its domain a "node affinity", i.e., a set of NUMA nodes of the > +host from which it gets its memory allocated. > + > +NUMA awareness becomes very important as soon as many domains start running > +memory-intensive workloads on a shared host. In fact, the cost of accessing non > +node-local memory locations is very high, and the performance degradation is > +likely to be noticeable. > + > +## Guest Placement in xl ## > + > +If using xl for creating and managing guests, it is very easy to ask for both > +manual or automatic placement of them across the host's NUMA nodes. > + > +Note that xm/xend does the very same thing, the only differences residing in > +the details of the heuristics adopted for the placement (see below). > + > +### Manual Guest Placement with xl ### > + > +Thanks to the "cpus=" option, it is possible to specify where a domain should > +be created and scheduled on, directly in its config file. This affects NUMA > +placement and memory accesses as the hypervisor constructs the node affinity of > +a VM basing right on its CPU affinity when it is created. > + > +This is very simple and effective, but requires the user/system administrator > +to explicitly specify affinities for each and every domain, or Xen won't be > +able to guarantee the locality for their memory accesses. > + > +It is also possible to deal with NUMA by partitioning the system using cpupools > +(available in the upcoming release of Xen, 4.2). Again, this could be "The > +Right Answer" for many needs and occasions, but has to to be carefully > +considered and manually setup by hand. > + > +### Automatic Guest Placement with xl ### > + > +In case no "cpus=" option is specified in the config file, libxl tries to I think "If no 'cpus=' option..." is better here. > +figure out on its own on which node(s) the domain could fit best. It is > +worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an > +host host is an incarnation of the Bin Packing Problem. In fact, the various host host > +VMs with different memory sizes are the items to be packed, and the host nodes > +are the bins. That is known to be NP-hard, thus, it is probably better to > +tackle the problem with some sort of hauristics, as we do not have any oracle > +available! I think you can just say "...is an incarnation of the Bin Packing Problem, which is known to be NP-hard." We will therefore be using some heuristics." (nb the spelling of "heuristics" as well.) > + > +The first thing to do is finding a node, or even a set of nodes, that have > +enough free memory and enough physical CPUs for accommodating the one new > +domain. The idea is to find a spot for the domain with at least as much free > +memory as it has configured, and as much pCPUs as it has vCPUs. After that, > +the actual decision on which solution to go for happens accordingly to the > +following heuristics: > + > + * candidates involving fewer nodes come first. In case two (or more) > + candidates span the same number of nodes, > + * the amount of free memory and the number of domains assigned to the > + candidates are considered. In doing that, candidates with greater amount > + of free memory and fewer assigned domains are preferred, with free memory > + "weighting" three times as much as number of domains. > + > +Giving preference to small candidates ensures better performance for the guest, I think I would say "candidates with fewer nodes" here; "small candidates" doesn't convey "fewer nodes" to me. > +as it avoid spreading its memory among different nodes. Favouring the nodes > +that have the biggest amounts of free memory helps keeping the memory We normally don't say "big amount", but "large amount" (don't ask me why -- just sounds a bit funny to me). So this would be "largest amount". > +fragmentation small, from a system wide perspective. However, in case more Again, s/in case/if/; Other than that, looks good to me. -George > +candidates fulfil these criteria by roughly the same extent, having the number > +of domains the candidates are "hosting" helps balancing the load on the various > +nodes. > + > +## Guest Placement within libxl ## > + > +xl achieves automatic NUMA just because libxl does it interrnally. > +No API is provided (yet) for interacting with this feature and modify > +the library behaviour regarding automatic placement, it just happens > +by default if no affinity is specified (as it is with xm/xend). > + > +For actually looking and maybe tweaking the mechanism and the algorithms it > +uses, all is implemented as a set of libxl internal interfaces and facilities. > +Look at the comment "Automatic NUMA placement" in libxl\_internal.h. > + > +Note this may change in future versions of Xen/libxl.