From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling Date: Tue, 5 Nov 2013 16:20:48 +0000 Message-ID: <52791AE0.2000102@eu.citrix.com> References: <20131105142844.30446.78671.stgit@Solace> <20131105143519.30446.9826.stgit@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20131105143519.30446.9826.stgit@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Marcus Granado , Keir Fraser , Ian Campbell , Li Yechen , Andrew Cooper , Juergen Gross , Ian Jackson , xen-devel@lists.xen.org, Jan Beulich , Justin Weaver , Daniel De Graaf , Matt Wilson , Elena Ufimtseva List-Id: xen-devel@lists.xenproject.org On 11/05/2013 02:35 PM, Dario Faggioli wrote: > instead of relying the domain-wide node-affinity. To achieve > that, we use the specific vcpu's node-affinity when computing > the NUMA load balancing mask in csched_balance_cpumask(). > > As a side effects, this simplifies the code quit a bit. In > fact, prior to this change, we needed to cache the translation > of d->node_affinity (which is a nodemask_t) to a cpumask_t, > since that is what is needed during the actual scheduling > (we used to keep it in node_affinity_cpumask). > > Since the per-vcpu node-affinity is maintained in a cpumask_t > field (v->node_affinity) already, we don't need that complicated > updating logic in place any longer, and this change hence can > remove stuff like sched_set_node_affinity(), > csched_set_node_affinity() and, of course, node_affinity_cpumask > from csched_dom. > > The high level description of NUMA placement and scheduling in > docs/misc/xl-numa-placement.markdown is updated too, to match > the new behavior. While at it, an attempt is made to make such > document as few ambiguous as possible, with respect to the > concepts of vCPU pinning, domain node-affinity and vCPU > node-affinity. > > Signed-off-by: Dario Faggioli > --- > docs/misc/xl-numa-placement.markdown | 124 +++++++++++++++++++++++----------- > xen/common/domain.c | 2 - > xen/common/sched_credit.c | 62 ++--------------- > xen/common/schedule.c | 5 - > xen/include/xen/sched-if.h | 2 - > xen/include/xen/sched.h | 1 > 6 files changed, 94 insertions(+), 102 deletions(-) > > diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown > index caa3fec..890b856 100644 > --- a/docs/misc/xl-numa-placement.markdown > +++ b/docs/misc/xl-numa-placement.markdown > @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually > defined as a set of processor cores (typically a physical CPU package) and > the memory directly attached to the set of cores. > > -The Xen hypervisor deals with NUMA machines by assigning to each domain > -a "node affinity", i.e., a set of NUMA nodes of the host from which they > -get their memory allocated. Also, even if the node affinity of a domain > -is allowed to change on-line, it is very important to "place" the domain > -correctly when it is fist created, as the most of its memory is allocated > -at that time and can not (for now) be moved easily. > - > NUMA awareness becomes very important as soon as many domains start > running memory-intensive workloads on a shared host. In fact, the cost > of accessing non node-local memory locations is very high, and the > @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable. > For more information, have a look at the [Xen NUMA Introduction][numa_intro] > page on the Wiki. > > +## Xen and NUMA machines: the concept of _node-affinity_ ## > + > +The Xen hypervisor deals with NUMA machines throughout the concept of > +_node-affinity_. When talking about node-affinity, it is important to > +distinguish two different situations. The node-affinity *of a domain* is > +the set of NUMA nodes of the host where the memory for the domain is > +being allocated (mostly, at domain creation time). The node-affinity *of > +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to > +run on, and the (credit1 only, for now) scheduler will try to accomplish > +that, whenever it is possible. > + > +Of course, despite the fact that they belong to and affect different > +subsystems domain and vCPUs node-affinities are related. In fact, the > +node-affinity of a domain is the union of the node-affinities of all the > +domain's vCPUs. > + > +The above means that, when changing the vCPU node-affinity, the domain > +node-affinity also changes. Well, although this is allowed to happen > +on-line (i.e., when a domain is already running), that will not result > +in the memory that has already been allocated being moved to a different > +host NUMA node. This is why it is very important to "place" the domain > +correctly when it is fist created, as the most of its memory is allocated > +at that time and can not (for now) be moved easily. > + > ### Placing via pinning and cpupools ### > > The simplest way of placing a domain on a NUMA node is statically pinning > -the domain's vCPUs to the pCPUs of the node. This goes under the name of > -CPU affinity and can be set through the "cpus=" option in the config file > -(more about this below). Another option is to pool together the pCPUs > -spanning the node and put the domain in such a cpupool with the "pool=" > +the domain's vCPUs to the pCPUs of the node. This also goes under the name > +of vCPU-affinity and can be set through the "cpus=" option in the config > +file (more about this below). Another option is to pool together the pCPUs > +spanning the node and put the domain in such a _cpupool_ with the "pool=" > config option (as documented in our [Wiki][cpupools_howto]). > > In both the above cases, the domain will not be able to execute outside > @@ -45,24 +62,45 @@ may come at he cost of some load imbalances. > > ### NUMA aware scheduling ### > > -If the credit scheduler is in use, the concept of node affinity defined > -above does not only apply to memory. In fact, starting from Xen 4.3, the > -scheduler always tries to run the domain's vCPUs on one of the nodes in > -its node affinity. Only if that turns out to be impossible, it will just > -pick any free pCPU. > +If using credit scheduler, and starting from Xen 4.3, the scheduler always > +tries to run the domain's vCPUs on one of the nodes in their node-affinity. > +Only if that turns out to be impossible, it will just pick any free pCPU. > +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity, > +potentially different from the ones of all the other vCPUs of the domain. > > -This is, therefore, something more flexible than CPU affinity, as a domain > -can still run everywhere, it just prefers some nodes rather than others. > +This is, therefore, something more flexible than vCPU pinning, as vCPUs > +can still run everywhere, they just prefer some nodes rather than others. > Locality of access is less guaranteed than in the pinning case, but that > comes along with better chances to exploit all the host resources (e.g., > the pCPUs). > > -In fact, if all the pCPUs in a domain's node affinity are busy, it is > -possible for the domain to run outside of there, but it is very likely that > -slower execution (due to remote memory accesses) is still better than no > -execution at all, as it would happen with pinning. For this reason, NUMA > -aware scheduling has the potential of bringing substantial performances > -benefits, although this will depend on the workload. > +In fact, if all the pCPUs in a VCPU's node-affinity are busy, it is > +possible for the domain to run outside from there. The idea is that > +slower execution (due to remote memory accesses) is still better than > +no execution at all (as it would happen with pinning). For this reason, > +NUMA aware scheduling has the potential of bringing substantial > +performances benefits, although this will depend on the workload. > + > +Notice that, for each vCPU, the following three scenarios are possbile: > + > + * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU > + node-affinity. In this case, the vCPU is always scheduled on one > + of the pCPUs to which it is pinned, without any specific peference > + on which one of them. Internally, the vCPU's node-affinity is just > + automatically computed from the vCPU pinning, and the scheduler > + just ignores it; > + * a vCPU *has* its own vCPU node-affinity and *is not* pinned to > + any particular pCPU. In this case, the vCPU can run on every pCPU. > + Nevertheless, the scheduler will try to have it running on one of > + the pCPUs of the node(s) to which it has node-affinity with; > + * a vCPU *has* its own vCPU node-affinity and *is also* pinned to > + some pCPUs. In this case, the vCPU is always scheduled on one of the > + pCPUs to which it is pinned, with, among them, a preference for the > + ones that are from the node(s) it has node-affinity with. In case > + pinning and node-affinity form two disjoint sets of pCPUs, pinning > + "wins", and the node-affinity, although it is still used to derive > + the domain's node-affinity (for memory allocation), is, from the > + scheduler's perspective, just ignored. > > ## Guest placement in xl ## > > @@ -71,25 +109,23 @@ both manual or automatic placement of them across the host's NUMA nodes. > > Note that xm/xend does a very similar thing, the only differences being > the details of the heuristics adopted for automatic placement (see below), > -and the lack of support (in both xm/xend and the Xen versions where that\ > +and the lack of support (in both xm/xend and the Xen versions where that > was the default toolstack) for NUMA aware scheduling. > > ### Placing the guest manually ### > > Thanks to the "cpus=" option, it is possible to specify where a domain > should be created and scheduled on, directly in its config file. This > -affects NUMA placement and memory accesses as the hypervisor constructs > -the node affinity of a VM basing right on its CPU affinity when it is > -created. > +affects NUMA placement and memory accesses as, in this case, the hypervisor > +constructs the node-affinity of a VM basing right on its vCPU pinning > +when it is created. > > This is very simple and effective, but requires the user/system > -administrator to explicitly specify affinities for each and every domain, > +administrator to explicitly specify the pinning of each and every domain, > or Xen won't be able to guarantee the locality for their memory accesses. > > -Notice that this also pins the domain's vCPUs to the specified set of > -pCPUs, so it not only sets the domain's node affinity (its memory will > -come from the nodes to which the pCPUs belong), but at the same time > -forces the vCPUs of the domain to be scheduled on those same pCPUs. > +That, of course, alsos mean the vCPUs of the domain will only be able to > +execute on those same pCPUs. > > ### Placing the guest automatically ### > > @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl tries > to figure out on its own on which node(s) the domain could fit best. > If it finds one (some), the domain's node affinity get set to there, > and both memory allocations and NUMA aware scheduling (for the credit > -scheduler and starting from Xen 4.3) will comply with it. > +scheduler and starting from Xen 4.3) will comply with it. Starting from > +Xen 4.4, this just means all the vCPUs of the domain will have the same > +vCPU node-affinity, that is the outcome of such automatic "fitting" > +procedure. > > It is worthwhile noting that optimally fitting a set of VMs on the NUMA > nodes of an host is an incarnation of the Bin Packing Problem. In fact, > @@ -143,22 +182,29 @@ any placement from happening: > libxl_defbool_set(&domain_build_info->numa_placement, false); > > Also, if `numa_placement` is set to `true`, the domain must not > -have any CPU affinity (i.e., `domain_build_info->cpumap` must > -have all its bits set, as it is by default), or domain creation > -will fail returning `ERROR_INVAL`. > +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have > +all its bits set, as it is by default), or domain creation will fail > +with an `ERROR_INVAL`. > > Starting from Xen 4.3, in case automatic placement happens (and is > successful), it will affect the domain's node affinity and _not_ its > -CPU affinity. Namely, the domain's vCPUs will not be pinned to any > +vCPU pinning. Namely, the domain's vCPUs will not be pinned to any > pCPU on the host, but the memory from the domain will come from the > selected node(s) and the NUMA aware scheduling (if the credit scheduler > -is in use) will try to keep the domain there as much as possible. > +is in use) will try to keep the domain's vCPUs there as much as possible. > > Besides than that, looking and/or tweaking the placement algorithm > search "Automatic NUMA placement" in libxl\_internal.h. > > Note this may change in future versions of Xen/libxl. > > +## Xen < 4.4 ## > + > +The concept of per-vCPU node-affinity has been introduced for the first > +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is > +the same for the whole domain, that is to say the same for all the vCPUs > +of the domain. > + > ## Xen < 4.3 ## > > As NUMA aware scheduling is a new feature of Xen 4.3, things are a little > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 366d9b9..ae29945 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d) > for_each_cpu ( cpu, cpumask ) > node_set(cpu_to_node(cpu), d->node_affinity); > > - sched_set_node_affinity(d, &d->node_affinity); > - > spin_unlock(&d->node_affinity_lock); > > free_cpumask_var(online_affinity); > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > index c53a36b..d228127 100644 > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -178,9 +178,6 @@ struct csched_dom { > struct list_head active_vcpu; > struct list_head active_sdom_elem; > struct domain *dom; > - /* cpumask translated from the domain's node-affinity. > - * Basically, the CPUs we prefer to be scheduled on. */ > - cpumask_var_t node_affinity_cpumask; > uint16_t active_vcpu_count; > uint16_t weight; > uint16_t cap; > @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc) > list_del_init(&svc->runq_elem); > } > > -/* > - * Translates node-affinity mask into a cpumask, so that we can use it during > - * actual scheduling. That of course will contain all the cpus from all the > - * set nodes in the original node-affinity mask. > - * > - * Note that any serialization needed to access mask safely is complete > - * responsibility of the caller of this function/hook. > - */ > -static void csched_set_node_affinity( > - const struct scheduler *ops, > - struct domain *d, > - nodemask_t *mask) > -{ > - struct csched_dom *sdom; > - int node; > - > - /* Skip idle domain since it doesn't even have a node_affinity_cpumask */ > - if ( unlikely(is_idle_domain(d)) ) > - return; > - > - sdom = CSCHED_DOM(d); > - cpumask_clear(sdom->node_affinity_cpumask); > - for_each_node_mask( node, *mask ) > - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, > - &node_to_cpumask(node)); > -} > > #define for_each_csched_balance_step(step) \ > for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ ) > @@ -294,12 +265,12 @@ static void csched_set_node_affinity( > > /* > * vcpu-affinity balancing is always necessary and must never be skipped. > - * OTOH, if a domain's node-affinity is said to be automatically computed > - * (or if it just spans all the nodes), we can safely avoid dealing with > - * node-affinity entirely. > + * OTOH, if the vcpu's numa-affinity is being automatically computed out of > + * the vcpu's vcpu-affinity, or if it just spans all the nodes, we can > + * safely avoid dealing with numa-affinity entirely. > * > - * Node-affinity is also deemed meaningless in case it has empty > - * intersection with mask, to cover the cases where using the node-affinity > + * A vcpu's numa-affinity is also deemed meaningless in case it has empty > + * intersection with mask, to cover the cases where using the numa-affinity > * mask seems legit, but would instead led to trying to schedule the vcpu > * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's > * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus > @@ -308,11 +279,9 @@ static void csched_set_node_affinity( > static inline int __vcpu_has_node_affinity(const struct vcpu *vc, > const cpumask_t *mask) > { > - const struct domain *d = vc->domain; > - const struct csched_dom *sdom = CSCHED_DOM(d); > - > - if (cpumask_full(sdom->node_affinity_cpumask) > - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) > + if ( vc->auto_node_affinity == 1 > + || cpumask_full(vc->node_affinity) > + || !cpumask_intersects(vc->node_affinity, mask) ) Isn't cpumask_full(vc->node_affinity) <=> vc->auto_node_affinity at this point? Other than that, I think this whole patch looks good -- nice when you can add a feature and simplify the code. :-) -George