From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <juergen.gross@ts.fujitsu.com>
Subject: Re: [PATCH 11 of 11 v3] docs: rearrange and update NUMA
 placement documentation
Date: Fri, 01 Feb 2013 14:41:47 +0100
Message-ID: <510BC61B.6090300@ts.fujitsu.com>
References: <patchbomb.1359716470@Solace>
	<f05ac3f656309bc71d43.1359716481@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <f05ac3f656309bc71d43.1359716481@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>, Dan Magenheimer <dan.magenheimer@oracle.com>, Ian Campbell <Ian.Campbell@citrix.com>, Anil Madhavapeddy <anil@recoil.org>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel@lists.xen.org, Jan Beulich <JBeulich@suse.com>, Daniel De Graaf <dgdegra@tycho.nsa.gov>, Matt Wilson <msw@amazon.com>
List-Id: xen-devel@lists.xenproject.org

Am 01.02.2013 12:01, schrieb Dario Faggioli:
> To include the new concept of NUMA aware scheduling and
> describe its impact.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>

Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

>
> diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
> --- a/docs/misc/xl-numa-placement.markdown
> +++ b/docs/misc/xl-numa-placement.markdown
> @@ -14,22 +14,67 @@ the memory directly attached to the set
>
>   The Xen hypervisor deals with NUMA machines by assigning to each domain
>   a "node affinity", i.e., a set of NUMA nodes of the host from which they
> -get their memory allocated.
> +get their memory allocated. Also, even if the node affinity of a domain
> +is allowed to change on-line, it is very important to "place" the domain
> +correctly when it is fist created, as the most of its memory is allocated
> +at that time and can not (for now) be moved easily.
>
>   NUMA awareness becomes very important as soon as many domains start
>   running memory-intensive workloads on a shared host. In fact, the cost
>   of accessing non node-local memory locations is very high, and the
>   performance degradation is likely to be noticeable.
>
> -## Guest Placement in xl ##
> +For more information, have a look at the [Xen NUMA Introduction][numa_intro]
> +page on the Wiki.
> +
> +### Placing via pinning and cpupools ###
> +
> +The simplest way of placing a domain on a NUMA node is statically pinning
> +the domain's vCPUs to the pCPUs of the node. This goes under the name of
> +CPU affinity and can be set through the "cpus=" option in the config file
> +(more about this below). Another option is to pool together the pCPUs
> +spanning the node and put the domain in such a cpupool with the "pool="
> +config option (as documented in our [Wiki][cpupools_howto]).
> +
> +In both the above cases, the domain will not be able to execute outside
> +the specified set of pCPUs for any reasons, even if all those pCPUs are
> +busy doing something else while there are others, idle, pCPUs.
> +
> +So, when doing this, local memory accesses are 100% guaranteed, but that
> +may come at he cost of some load imbalances.
> +
> +### NUMA aware scheduling ###
> +
> +If the credit scheduler is in use, the concept of node affinity defined
> +above does not only apply to memory. In fact, starting from Xen 4.3, the
> +scheduler always tries to run the domain's vCPUs on one of the nodes in
> +its node affinity. Only if that turns out to be impossible, it will just
> +pick any free pCPU.
> +
> +This is, therefore, something more flexible than CPU affinity, as a domain
> +can still run everywhere, it just prefers some nodes rather than others.
> +Locality of access is less guaranteed than in the pinning case, but that
> +comes along with better chances to exploit all the host resources (e.g.,
> +the pCPUs).
> +
> +In fact, if all the pCPUs in a domain's node affinity are busy, it is
> +possible for the domain to run outside of there, but it is very likely that
> +slower execution (due to remote memory accesses) is still better than no
> +execution at all, as it would happen with pinning. For this reason, NUMA
> +aware scheduling has the potential of bringing substantial performances
> +benefits, although this will depend on the workload.
> +
> +## Guest placement in xl ##
>
>   If using xl for creating and managing guests, it is very easy to ask for
>   both manual or automatic placement of them across the host's NUMA nodes.
>
> -Note that xm/xend does the very same thing, the only differences residing
> -in the details of the heuristics adopted for the placement (see below).
> +Note that xm/xend does a very similar thing, the only differences being
> +the details of the heuristics adopted for automatic placement (see below),
> +and the lack of support (in both xm/xend and the Xen versions where that\
> +was the default toolstack) for NUMA aware scheduling.
>
> -### Manual Guest Placement with xl ###
> +### Placing the guest manually ###
>
>   Thanks to the "cpus=" option, it is possible to specify where a domain
>   should be created and scheduled on, directly in its config file. This
> @@ -41,14 +86,19 @@ This is very simple and effective, but r
>   administrator to explicitly specify affinities for each and every domain,
>   or Xen won't be able to guarantee the locality for their memory accesses.
>
> -It is also possible to deal with NUMA by partitioning the system using
> -cpupools. Again, this could be "The Right Answer" for many needs and
> -occasions, but has to be carefully considered and setup by hand.
> +Notice that this also pins the domain's vCPUs to the specified set of
> +pCPUs, so it not only sets the domain's node affinity (its memory will
> +come from the nodes to which the pCPUs belong), but at the same time
> +forces the vCPUs of the domain to be scheduled on those same pCPUs.
>
> -### Automatic Guest Placement with xl ###
> +### Placing the guest automatically ###
>
>   If no "cpus=" option is specified in the config file, libxl tries
>   to figure out on its own on which node(s) the domain could fit best.
> +If it finds one (some), the domain's node affinity get set to there,
> +and both memory allocations and NUMA aware scheduling (for the credit
> +scheduler and starting from Xen 4.3) will comply with it.
> +
>   It is worthwhile noting that optimally fitting a set of VMs on the NUMA
>   nodes of an host is an incarnation of the Bin Packing Problem. In fact,
>   the various VMs with different memory sizes are the items to be packed,
> @@ -81,7 +131,7 @@ largest amounts of free memory helps kee
>   small, and maximizes the probability of being able to put more domains
>   there.
>
> -## Guest Placement within libxl ##
> +## Guest placement in libxl ##
>
>   xl achieves automatic NUMA placement because that is what libxl does
>   by default. No API is provided (yet) for modifying the behaviour of
> @@ -93,15 +143,34 @@ any placement from happening:
>       libxl_defbool_set(&domain_build_info->numa_placement, false);
>
>   Also, if `numa_placement` is set to `true`, the domain must not
> -have any cpu affinity (i.e., `domain_build_info->cpumap` must
> +have any CPU affinity (i.e., `domain_build_info->cpumap` must
>   have all its bits set, as it is by default), or domain creation
>   will fail returning `ERROR_INVAL`.
>
> +Starting from Xen 4.3, in case automatic placement happens (and is
> +successful), it will affect the domain's node affinity and _not_ its
> +CPU affinity. Namely, the domain's vCPUs will not be pinned to any
> +pCPU on the host, but the memory from the domain will come from the
> +selected node(s) and the NUMA aware scheduling (if the credit scheduler
> +is in use) will try to keep the domain there as much as possible.
> +
>   Besides than that, looking and/or tweaking the placement algorithm
>   search "Automatic NUMA placement" in libxl\_internal.h.
>
>   Note this may change in future versions of Xen/libxl.
>
> +## Xen<  4.3 ##
> +
> +As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
> +bit different for earlier version of Xen. If no "cpus=" option is specified
> +and Xen 4.2 is in use, the automatic placement algorithm still runs, but
> +the results is used to _pin_ the vCPUs of the domain to the output node(s).
> +This is consistent with what was happening with xm/xend, which were also
> +affecting the domain's CPU affinity.
> +
> +On a version of Xen earlier than 4.2, there is not automatic placement at
> +all in xl or libxl, and hence no node or CPU affinity being affected.
> +
>   ## Limitations ##
>
>   Analyzing various possible placement solutions is what makes the
> @@ -109,3 +178,6 @@ algorithm flexible and quite effective.
>   it won't scale well to systems with arbitrary number of nodes.
>   For this reason, automatic placement is disabled (with a warning)
>   if it is requested on a host with more than 16 NUMA nodes.
> +
> +[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction
> +[cpupools_howto]: http://wiki.xen.org/wiki/Cpupools_Howto
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>


-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html