From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu
 node-affinity for actual scheduling
Date: Tue, 5 Nov 2013 16:20:48 +0000
Message-ID: <52791AE0.2000102@eu.citrix.com>
References: <20131105142844.30446.78671.stgit@Solace>
	<20131105143519.30446.9826.stgit@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20131105143519.30446.9826.stgit@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>, Keir Fraser <keir@xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, Li Yechen <lccycc123@gmail.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel@lists.xen.org, Jan Beulich <JBeulich@suse.com>, Justin Weaver <jtweaver@hawaii.edu>, Daniel De Graaf <dgdegra@tycho.nsa.gov>, Matt Wilson <msw@amazon.com>, Elena Ufimtseva <ufimtseva@gmail.com>
List-Id: xen-devel@lists.xenproject.org

On 11/05/2013 02:35 PM, Dario Faggioli wrote:
> instead of relying the domain-wide node-affinity. To achieve
> that, we use the specific vcpu's node-affinity when computing
> the NUMA load balancing mask in csched_balance_cpumask().
>
> As a side effects, this simplifies the code quit a bit. In
> fact, prior to this change, we needed to cache the translation
> of d->node_affinity (which is a nodemask_t) to a cpumask_t,
> since that is what is needed during the actual scheduling
> (we used to keep it in node_affinity_cpumask).
>
> Since the per-vcpu node-affinity is maintained in a cpumask_t
> field (v->node_affinity) already, we don't need that complicated
> updating logic in place any longer, and this change hence can
> remove stuff like sched_set_node_affinity(),
> csched_set_node_affinity() and, of course, node_affinity_cpumask
> from csched_dom.
>
> The high level description of NUMA placement and scheduling in
> docs/misc/xl-numa-placement.markdown is updated too, to match
> the new behavior. While at it, an attempt is made to make such
> document as few ambiguous as possible, with respect to the
> concepts of vCPU pinning, domain node-affinity and vCPU
> node-affinity.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
>   docs/misc/xl-numa-placement.markdown |  124 +++++++++++++++++++++++-----------
>   xen/common/domain.c                  |    2 -
>   xen/common/sched_credit.c            |   62 ++---------------
>   xen/common/schedule.c                |    5 -
>   xen/include/xen/sched-if.h           |    2 -
>   xen/include/xen/sched.h              |    1
>   6 files changed, 94 insertions(+), 102 deletions(-)
>
> diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
> index caa3fec..890b856 100644
> --- a/docs/misc/xl-numa-placement.markdown
> +++ b/docs/misc/xl-numa-placement.markdown
> @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually
>   defined as a set of processor cores (typically a physical CPU package) and
>   the memory directly attached to the set of cores.
>
> -The Xen hypervisor deals with NUMA machines by assigning to each domain
> -a "node affinity", i.e., a set of NUMA nodes of the host from which they
> -get their memory allocated. Also, even if the node affinity of a domain
> -is allowed to change on-line, it is very important to "place" the domain
> -correctly when it is fist created, as the most of its memory is allocated
> -at that time and can not (for now) be moved easily.
> -
>   NUMA awareness becomes very important as soon as many domains start
>   running memory-intensive workloads on a shared host. In fact, the cost
>   of accessing non node-local memory locations is very high, and the
> @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable.
>   For more information, have a look at the [Xen NUMA Introduction][numa_intro]
>   page on the Wiki.
>
> +## Xen and NUMA machines: the concept of _node-affinity_ ##
> +
> +The Xen hypervisor deals with NUMA machines throughout the concept of
> +_node-affinity_. When talking about node-affinity, it is important to
> +distinguish two different situations. The node-affinity *of a domain* is
> +the set of NUMA nodes of the host where the memory for the domain is
> +being allocated (mostly, at domain creation time). The node-affinity *of
> +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to
> +run on, and the (credit1 only, for now) scheduler will try to accomplish
> +that, whenever it is possible.
> +
> +Of course, despite the fact that they belong to and affect different
> +subsystems domain and vCPUs node-affinities are related. In fact, the
> +node-affinity of a domain is the union of the node-affinities of all the
> +domain's vCPUs.
> +
> +The above means that, when changing the vCPU node-affinity, the domain
> +node-affinity also changes. Well, although this is allowed to happen
> +on-line (i.e., when a domain is already running), that will not result
> +in the memory that has already been allocated being moved to a different
> +host NUMA node. This is why it is very important to "place" the domain
> +correctly when it is fist created, as the most of its memory is allocated
> +at that time and can not (for now) be moved easily.
> +
>   ### Placing via pinning and cpupools ###
>
>   The simplest way of placing a domain on a NUMA node is statically pinning
> -the domain's vCPUs to the pCPUs of the node. This goes under the name of
> -CPU affinity and can be set through the "cpus=" option in the config file
> -(more about this below). Another option is to pool together the pCPUs
> -spanning the node and put the domain in such a cpupool with the "pool="
> +the domain's vCPUs to the pCPUs of the node. This also goes under the name
> +of vCPU-affinity and can be set through the "cpus=" option in the config
> +file (more about this below). Another option is to pool together the pCPUs
> +spanning the node and put the domain in such a _cpupool_ with the "pool="
>   config option (as documented in our [Wiki][cpupools_howto]).
>
>   In both the above cases, the domain will not be able to execute outside
> @@ -45,24 +62,45 @@ may come at he cost of some load imbalances.
>
>   ### NUMA aware scheduling ###
>
> -If the credit scheduler is in use, the concept of node affinity defined
> -above does not only apply to memory. In fact, starting from Xen 4.3, the
> -scheduler always tries to run the domain's vCPUs on one of the nodes in
> -its node affinity. Only if that turns out to be impossible, it will just
> -pick any free pCPU.
> +If using credit scheduler, and starting from Xen 4.3, the scheduler always
> +tries to run the domain's vCPUs on one of the nodes in their node-affinity.
> +Only if that turns out to be impossible, it will just pick any free pCPU.
> +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity,
> +potentially different from the ones of all the other vCPUs of the domain.
>
> -This is, therefore, something more flexible than CPU affinity, as a domain
> -can still run everywhere, it just prefers some nodes rather than others.
> +This is, therefore, something more flexible than vCPU pinning, as vCPUs
> +can still run everywhere, they just prefer some nodes rather than others.
>   Locality of access is less guaranteed than in the pinning case, but that
>   comes along with better chances to exploit all the host resources (e.g.,
>   the pCPUs).
>
> -In fact, if all the pCPUs in a domain's node affinity are busy, it is
> -possible for the domain to run outside of there, but it is very likely that
> -slower execution (due to remote memory accesses) is still better than no
> -execution at all, as it would happen with pinning. For this reason, NUMA
> -aware scheduling has the potential of bringing substantial performances
> -benefits, although this will depend on the workload.
> +In fact, if all the pCPUs in a VCPU's node-affinity are busy, it is
> +possible for the domain to run outside from there. The idea is that
> +slower execution (due to remote memory accesses) is still better than
> +no execution at all (as it would happen with pinning). For this reason,
> +NUMA aware scheduling has the potential of bringing substantial
> +performances benefits, although this will depend on the workload.
> +
> +Notice that, for each vCPU, the following three scenarios are possbile:
> +
> +  * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU
> +    node-affinity. In this case, the vCPU is always scheduled on one
> +    of the pCPUs to which it is pinned, without any specific peference
> +    on which one of them. Internally, the vCPU's node-affinity is just
> +    automatically computed from the vCPU pinning, and the scheduler
> +    just ignores it;
> +  * a vCPU *has* its own vCPU node-affinity and *is not* pinned to
> +    any particular pCPU. In this case, the vCPU can run on every pCPU.
> +    Nevertheless, the scheduler will try to have it running on one of
> +    the pCPUs of the node(s) to which it has node-affinity with;
> +  * a vCPU *has* its own vCPU node-affinity and *is also* pinned to
> +    some pCPUs. In this case, the vCPU is always scheduled on one of the
> +    pCPUs to which it is pinned, with, among them, a preference for the
> +    ones that are from the node(s) it has node-affinity with. In case
> +    pinning and node-affinity form two disjoint sets of pCPUs, pinning
> +    "wins", and the node-affinity, although it is still used to derive
> +    the domain's node-affinity (for memory allocation), is, from the
> +    scheduler's perspective, just ignored.
>
>   ## Guest placement in xl ##
>
> @@ -71,25 +109,23 @@ both manual or automatic placement of them across the host's NUMA nodes.
>
>   Note that xm/xend does a very similar thing, the only differences being
>   the details of the heuristics adopted for automatic placement (see below),
> -and the lack of support (in both xm/xend and the Xen versions where that\
> +and the lack of support (in both xm/xend and the Xen versions where that
>   was the default toolstack) for NUMA aware scheduling.
>
>   ### Placing the guest manually ###
>
>   Thanks to the "cpus=" option, it is possible to specify where a domain
>   should be created and scheduled on, directly in its config file. This
> -affects NUMA placement and memory accesses as the hypervisor constructs
> -the node affinity of a VM basing right on its CPU affinity when it is
> -created.
> +affects NUMA placement and memory accesses as, in this case, the hypervisor
> +constructs the node-affinity of a VM basing right on its vCPU pinning
> +when it is created.
>
>   This is very simple and effective, but requires the user/system
> -administrator to explicitly specify affinities for each and every domain,
> +administrator to explicitly specify the pinning of each and every domain,
>   or Xen won't be able to guarantee the locality for their memory accesses.
>
> -Notice that this also pins the domain's vCPUs to the specified set of
> -pCPUs, so it not only sets the domain's node affinity (its memory will
> -come from the nodes to which the pCPUs belong), but at the same time
> -forces the vCPUs of the domain to be scheduled on those same pCPUs.
> +That, of course, alsos mean the vCPUs of the domain will only be able to
> +execute on those same pCPUs.
>
>   ### Placing the guest automatically ###
>
> @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl tries
>   to figure out on its own on which node(s) the domain could fit best.
>   If it finds one (some), the domain's node affinity get set to there,
>   and both memory allocations and NUMA aware scheduling (for the credit
> -scheduler and starting from Xen 4.3) will comply with it.
> +scheduler and starting from Xen 4.3) will comply with it. Starting from
> +Xen 4.4, this just means all the vCPUs of the domain will have the same
> +vCPU node-affinity, that is the outcome of such automatic "fitting"
> +procedure.
>
>   It is worthwhile noting that optimally fitting a set of VMs on the NUMA
>   nodes of an host is an incarnation of the Bin Packing Problem. In fact,
> @@ -143,22 +182,29 @@ any placement from happening:
>       libxl_defbool_set(&domain_build_info->numa_placement, false);
>
>   Also, if `numa_placement` is set to `true`, the domain must not
> -have any CPU affinity (i.e., `domain_build_info->cpumap` must
> -have all its bits set, as it is by default), or domain creation
> -will fail returning `ERROR_INVAL`.
> +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have
> +all its bits set, as it is by default), or domain creation will fail
> +with an `ERROR_INVAL`.
>
>   Starting from Xen 4.3, in case automatic placement happens (and is
>   successful), it will affect the domain's node affinity and _not_ its
> -CPU affinity. Namely, the domain's vCPUs will not be pinned to any
> +vCPU pinning. Namely, the domain's vCPUs will not be pinned to any
>   pCPU on the host, but the memory from the domain will come from the
>   selected node(s) and the NUMA aware scheduling (if the credit scheduler
> -is in use) will try to keep the domain there as much as possible.
> +is in use) will try to keep the domain's vCPUs there as much as possible.
>
>   Besides than that, looking and/or tweaking the placement algorithm
>   search "Automatic NUMA placement" in libxl\_internal.h.
>
>   Note this may change in future versions of Xen/libxl.
>
> +## Xen < 4.4 ##
> +
> +The concept of per-vCPU node-affinity has been introduced for the first
> +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is
> +the same for the whole domain, that is to say the same for all the vCPUs
> +of the domain.
> +
>   ## Xen < 4.3 ##
>
>   As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 366d9b9..ae29945 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d)
>       for_each_cpu ( cpu, cpumask )
>           node_set(cpu_to_node(cpu), d->node_affinity);
>
> -    sched_set_node_affinity(d, &d->node_affinity);
> -
>       spin_unlock(&d->node_affinity_lock);
>
>       free_cpumask_var(online_affinity);
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index c53a36b..d228127 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -178,9 +178,6 @@ struct csched_dom {
>       struct list_head active_vcpu;
>       struct list_head active_sdom_elem;
>       struct domain *dom;
> -    /* cpumask translated from the domain's node-affinity.
> -     * Basically, the CPUs we prefer to be scheduled on. */
> -    cpumask_var_t node_affinity_cpumask;
>       uint16_t active_vcpu_count;
>       uint16_t weight;
>       uint16_t cap;
> @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc)
>       list_del_init(&svc->runq_elem);
>   }
>
> -/*
> - * Translates node-affinity mask into a cpumask, so that we can use it during
> - * actual scheduling. That of course will contain all the cpus from all the
> - * set nodes in the original node-affinity mask.
> - *
> - * Note that any serialization needed to access mask safely is complete
> - * responsibility of the caller of this function/hook.
> - */
> -static void csched_set_node_affinity(
> -    const struct scheduler *ops,
> -    struct domain *d,
> -    nodemask_t *mask)
> -{
> -    struct csched_dom *sdom;
> -    int node;
> -
> -    /* Skip idle domain since it doesn't even have a node_affinity_cpumask */
> -    if ( unlikely(is_idle_domain(d)) )
> -        return;
> -
> -    sdom = CSCHED_DOM(d);
> -    cpumask_clear(sdom->node_affinity_cpumask);
> -    for_each_node_mask( node, *mask )
> -        cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask,
> -                   &node_to_cpumask(node));
> -}
>
>   #define for_each_csched_balance_step(step) \
>       for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ )
> @@ -294,12 +265,12 @@ static void csched_set_node_affinity(
>
>   /*
>    * vcpu-affinity balancing is always necessary and must never be skipped.
> - * OTOH, if a domain's node-affinity is said to be automatically computed
> - * (or if it just spans all the nodes), we can safely avoid dealing with
> - * node-affinity entirely.
> + * OTOH, if the vcpu's numa-affinity is being automatically computed out of
> + * the vcpu's vcpu-affinity, or if it just spans all the nodes, we can
> + * safely avoid dealing with numa-affinity entirely.
>    *
> - * Node-affinity is also deemed meaningless in case it has empty
> - * intersection with mask, to cover the cases where using the node-affinity
> + * A vcpu's numa-affinity is also deemed meaningless in case it has empty
> + * intersection with mask, to cover the cases where using the numa-affinity
>    * mask seems legit, but would instead led to trying to schedule the vcpu
>    * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's
>    * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus
> @@ -308,11 +279,9 @@ static void csched_set_node_affinity(
>   static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
>                                              const cpumask_t *mask)
>   {
> -    const struct domain *d = vc->domain;
> -    const struct csched_dom *sdom = CSCHED_DOM(d);
> -
> -    if (cpumask_full(sdom->node_affinity_cpumask)
> -         || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
> +    if ( vc->auto_node_affinity == 1
> +         || cpumask_full(vc->node_affinity)
> +         || !cpumask_intersects(vc->node_affinity, mask) )

Isn't cpumask_full(vc->node_affinity) <=> vc->auto_node_affinity at this 
point?

Other than that, I think this whole patch looks good -- nice when you 
can add a feature and simplify the code. :-)

  -George