From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
Keir Fraser <keir@xen.org>,
Ian Campbell <Ian.Campbell@citrix.com>,
Li Yechen <lccycc123@gmail.com>,
Andrew Cooper <Andrew.Cooper3@citrix.com>,
Juergen Gross <juergen.gross@ts.fujitsu.com>,
Ian Jackson <Ian.Jackson@eu.citrix.com>,
xen-devel@lists.xen.org, Jan Beulich <JBeulich@suse.com>,
Justin Weaver <jtweaver@hawaii.edu>,
Daniel De Graaf <dgdegra@tycho.nsa.gov>,
Matt Wilson <msw@amazon.com>,
Elena Ufimtseva <ufimtseva@gmail.com>
Subject: Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling
Date: Tue, 5 Nov 2013 16:20:48 +0000 [thread overview]
Message-ID: <52791AE0.2000102@eu.citrix.com> (raw)
In-Reply-To: <20131105143519.30446.9826.stgit@Solace>
On 11/05/2013 02:35 PM, Dario Faggioli wrote:
> instead of relying the domain-wide node-affinity. To achieve
> that, we use the specific vcpu's node-affinity when computing
> the NUMA load balancing mask in csched_balance_cpumask().
>
> As a side effects, this simplifies the code quit a bit. In
> fact, prior to this change, we needed to cache the translation
> of d->node_affinity (which is a nodemask_t) to a cpumask_t,
> since that is what is needed during the actual scheduling
> (we used to keep it in node_affinity_cpumask).
>
> Since the per-vcpu node-affinity is maintained in a cpumask_t
> field (v->node_affinity) already, we don't need that complicated
> updating logic in place any longer, and this change hence can
> remove stuff like sched_set_node_affinity(),
> csched_set_node_affinity() and, of course, node_affinity_cpumask
> from csched_dom.
>
> The high level description of NUMA placement and scheduling in
> docs/misc/xl-numa-placement.markdown is updated too, to match
> the new behavior. While at it, an attempt is made to make such
> document as few ambiguous as possible, with respect to the
> concepts of vCPU pinning, domain node-affinity and vCPU
> node-affinity.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> docs/misc/xl-numa-placement.markdown | 124 +++++++++++++++++++++++-----------
> xen/common/domain.c | 2 -
> xen/common/sched_credit.c | 62 ++---------------
> xen/common/schedule.c | 5 -
> xen/include/xen/sched-if.h | 2 -
> xen/include/xen/sched.h | 1
> 6 files changed, 94 insertions(+), 102 deletions(-)
>
> diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
> index caa3fec..890b856 100644
> --- a/docs/misc/xl-numa-placement.markdown
> +++ b/docs/misc/xl-numa-placement.markdown
> @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually
> defined as a set of processor cores (typically a physical CPU package) and
> the memory directly attached to the set of cores.
>
> -The Xen hypervisor deals with NUMA machines by assigning to each domain
> -a "node affinity", i.e., a set of NUMA nodes of the host from which they
> -get their memory allocated. Also, even if the node affinity of a domain
> -is allowed to change on-line, it is very important to "place" the domain
> -correctly when it is fist created, as the most of its memory is allocated
> -at that time and can not (for now) be moved easily.
> -
> NUMA awareness becomes very important as soon as many domains start
> running memory-intensive workloads on a shared host. In fact, the cost
> of accessing non node-local memory locations is very high, and the
> @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable.
> For more information, have a look at the [Xen NUMA Introduction][numa_intro]
> page on the Wiki.
>
> +## Xen and NUMA machines: the concept of _node-affinity_ ##
> +
> +The Xen hypervisor deals with NUMA machines throughout the concept of
> +_node-affinity_. When talking about node-affinity, it is important to
> +distinguish two different situations. The node-affinity *of a domain* is
> +the set of NUMA nodes of the host where the memory for the domain is
> +being allocated (mostly, at domain creation time). The node-affinity *of
> +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to
> +run on, and the (credit1 only, for now) scheduler will try to accomplish
> +that, whenever it is possible.
> +
> +Of course, despite the fact that they belong to and affect different
> +subsystems domain and vCPUs node-affinities are related. In fact, the
> +node-affinity of a domain is the union of the node-affinities of all the
> +domain's vCPUs.
> +
> +The above means that, when changing the vCPU node-affinity, the domain
> +node-affinity also changes. Well, although this is allowed to happen
> +on-line (i.e., when a domain is already running), that will not result
> +in the memory that has already been allocated being moved to a different
> +host NUMA node. This is why it is very important to "place" the domain
> +correctly when it is fist created, as the most of its memory is allocated
> +at that time and can not (for now) be moved easily.
> +
> ### Placing via pinning and cpupools ###
>
> The simplest way of placing a domain on a NUMA node is statically pinning
> -the domain's vCPUs to the pCPUs of the node. This goes under the name of
> -CPU affinity and can be set through the "cpus=" option in the config file
> -(more about this below). Another option is to pool together the pCPUs
> -spanning the node and put the domain in such a cpupool with the "pool="
> +the domain's vCPUs to the pCPUs of the node. This also goes under the name
> +of vCPU-affinity and can be set through the "cpus=" option in the config
> +file (more about this below). Another option is to pool together the pCPUs
> +spanning the node and put the domain in such a _cpupool_ with the "pool="
> config option (as documented in our [Wiki][cpupools_howto]).
>
> In both the above cases, the domain will not be able to execute outside
> @@ -45,24 +62,45 @@ may come at he cost of some load imbalances.
>
> ### NUMA aware scheduling ###
>
> -If the credit scheduler is in use, the concept of node affinity defined
> -above does not only apply to memory. In fact, starting from Xen 4.3, the
> -scheduler always tries to run the domain's vCPUs on one of the nodes in
> -its node affinity. Only if that turns out to be impossible, it will just
> -pick any free pCPU.
> +If using credit scheduler, and starting from Xen 4.3, the scheduler always
> +tries to run the domain's vCPUs on one of the nodes in their node-affinity.
> +Only if that turns out to be impossible, it will just pick any free pCPU.
> +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity,
> +potentially different from the ones of all the other vCPUs of the domain.
>
> -This is, therefore, something more flexible than CPU affinity, as a domain
> -can still run everywhere, it just prefers some nodes rather than others.
> +This is, therefore, something more flexible than vCPU pinning, as vCPUs
> +can still run everywhere, they just prefer some nodes rather than others.
> Locality of access is less guaranteed than in the pinning case, but that
> comes along with better chances to exploit all the host resources (e.g.,
> the pCPUs).
>
> -In fact, if all the pCPUs in a domain's node affinity are busy, it is
> -possible for the domain to run outside of there, but it is very likely that
> -slower execution (due to remote memory accesses) is still better than no
> -execution at all, as it would happen with pinning. For this reason, NUMA
> -aware scheduling has the potential of bringing substantial performances
> -benefits, although this will depend on the workload.
> +In fact, if all the pCPUs in a VCPU's node-affinity are busy, it is
> +possible for the domain to run outside from there. The idea is that
> +slower execution (due to remote memory accesses) is still better than
> +no execution at all (as it would happen with pinning). For this reason,
> +NUMA aware scheduling has the potential of bringing substantial
> +performances benefits, although this will depend on the workload.
> +
> +Notice that, for each vCPU, the following three scenarios are possbile:
> +
> + * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU
> + node-affinity. In this case, the vCPU is always scheduled on one
> + of the pCPUs to which it is pinned, without any specific peference
> + on which one of them. Internally, the vCPU's node-affinity is just
> + automatically computed from the vCPU pinning, and the scheduler
> + just ignores it;
> + * a vCPU *has* its own vCPU node-affinity and *is not* pinned to
> + any particular pCPU. In this case, the vCPU can run on every pCPU.
> + Nevertheless, the scheduler will try to have it running on one of
> + the pCPUs of the node(s) to which it has node-affinity with;
> + * a vCPU *has* its own vCPU node-affinity and *is also* pinned to
> + some pCPUs. In this case, the vCPU is always scheduled on one of the
> + pCPUs to which it is pinned, with, among them, a preference for the
> + ones that are from the node(s) it has node-affinity with. In case
> + pinning and node-affinity form two disjoint sets of pCPUs, pinning
> + "wins", and the node-affinity, although it is still used to derive
> + the domain's node-affinity (for memory allocation), is, from the
> + scheduler's perspective, just ignored.
>
> ## Guest placement in xl ##
>
> @@ -71,25 +109,23 @@ both manual or automatic placement of them across the host's NUMA nodes.
>
> Note that xm/xend does a very similar thing, the only differences being
> the details of the heuristics adopted for automatic placement (see below),
> -and the lack of support (in both xm/xend and the Xen versions where that\
> +and the lack of support (in both xm/xend and the Xen versions where that
> was the default toolstack) for NUMA aware scheduling.
>
> ### Placing the guest manually ###
>
> Thanks to the "cpus=" option, it is possible to specify where a domain
> should be created and scheduled on, directly in its config file. This
> -affects NUMA placement and memory accesses as the hypervisor constructs
> -the node affinity of a VM basing right on its CPU affinity when it is
> -created.
> +affects NUMA placement and memory accesses as, in this case, the hypervisor
> +constructs the node-affinity of a VM basing right on its vCPU pinning
> +when it is created.
>
> This is very simple and effective, but requires the user/system
> -administrator to explicitly specify affinities for each and every domain,
> +administrator to explicitly specify the pinning of each and every domain,
> or Xen won't be able to guarantee the locality for their memory accesses.
>
> -Notice that this also pins the domain's vCPUs to the specified set of
> -pCPUs, so it not only sets the domain's node affinity (its memory will
> -come from the nodes to which the pCPUs belong), but at the same time
> -forces the vCPUs of the domain to be scheduled on those same pCPUs.
> +That, of course, alsos mean the vCPUs of the domain will only be able to
> +execute on those same pCPUs.
>
> ### Placing the guest automatically ###
>
> @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl tries
> to figure out on its own on which node(s) the domain could fit best.
> If it finds one (some), the domain's node affinity get set to there,
> and both memory allocations and NUMA aware scheduling (for the credit
> -scheduler and starting from Xen 4.3) will comply with it.
> +scheduler and starting from Xen 4.3) will comply with it. Starting from
> +Xen 4.4, this just means all the vCPUs of the domain will have the same
> +vCPU node-affinity, that is the outcome of such automatic "fitting"
> +procedure.
>
> It is worthwhile noting that optimally fitting a set of VMs on the NUMA
> nodes of an host is an incarnation of the Bin Packing Problem. In fact,
> @@ -143,22 +182,29 @@ any placement from happening:
> libxl_defbool_set(&domain_build_info->numa_placement, false);
>
> Also, if `numa_placement` is set to `true`, the domain must not
> -have any CPU affinity (i.e., `domain_build_info->cpumap` must
> -have all its bits set, as it is by default), or domain creation
> -will fail returning `ERROR_INVAL`.
> +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have
> +all its bits set, as it is by default), or domain creation will fail
> +with an `ERROR_INVAL`.
>
> Starting from Xen 4.3, in case automatic placement happens (and is
> successful), it will affect the domain's node affinity and _not_ its
> -CPU affinity. Namely, the domain's vCPUs will not be pinned to any
> +vCPU pinning. Namely, the domain's vCPUs will not be pinned to any
> pCPU on the host, but the memory from the domain will come from the
> selected node(s) and the NUMA aware scheduling (if the credit scheduler
> -is in use) will try to keep the domain there as much as possible.
> +is in use) will try to keep the domain's vCPUs there as much as possible.
>
> Besides than that, looking and/or tweaking the placement algorithm
> search "Automatic NUMA placement" in libxl\_internal.h.
>
> Note this may change in future versions of Xen/libxl.
>
> +## Xen < 4.4 ##
> +
> +The concept of per-vCPU node-affinity has been introduced for the first
> +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is
> +the same for the whole domain, that is to say the same for all the vCPUs
> +of the domain.
> +
> ## Xen < 4.3 ##
>
> As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 366d9b9..ae29945 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d)
> for_each_cpu ( cpu, cpumask )
> node_set(cpu_to_node(cpu), d->node_affinity);
>
> - sched_set_node_affinity(d, &d->node_affinity);
> -
> spin_unlock(&d->node_affinity_lock);
>
> free_cpumask_var(online_affinity);
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index c53a36b..d228127 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -178,9 +178,6 @@ struct csched_dom {
> struct list_head active_vcpu;
> struct list_head active_sdom_elem;
> struct domain *dom;
> - /* cpumask translated from the domain's node-affinity.
> - * Basically, the CPUs we prefer to be scheduled on. */
> - cpumask_var_t node_affinity_cpumask;
> uint16_t active_vcpu_count;
> uint16_t weight;
> uint16_t cap;
> @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc)
> list_del_init(&svc->runq_elem);
> }
>
> -/*
> - * Translates node-affinity mask into a cpumask, so that we can use it during
> - * actual scheduling. That of course will contain all the cpus from all the
> - * set nodes in the original node-affinity mask.
> - *
> - * Note that any serialization needed to access mask safely is complete
> - * responsibility of the caller of this function/hook.
> - */
> -static void csched_set_node_affinity(
> - const struct scheduler *ops,
> - struct domain *d,
> - nodemask_t *mask)
> -{
> - struct csched_dom *sdom;
> - int node;
> -
> - /* Skip idle domain since it doesn't even have a node_affinity_cpumask */
> - if ( unlikely(is_idle_domain(d)) )
> - return;
> -
> - sdom = CSCHED_DOM(d);
> - cpumask_clear(sdom->node_affinity_cpumask);
> - for_each_node_mask( node, *mask )
> - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask,
> - &node_to_cpumask(node));
> -}
>
> #define for_each_csched_balance_step(step) \
> for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ )
> @@ -294,12 +265,12 @@ static void csched_set_node_affinity(
>
> /*
> * vcpu-affinity balancing is always necessary and must never be skipped.
> - * OTOH, if a domain's node-affinity is said to be automatically computed
> - * (or if it just spans all the nodes), we can safely avoid dealing with
> - * node-affinity entirely.
> + * OTOH, if the vcpu's numa-affinity is being automatically computed out of
> + * the vcpu's vcpu-affinity, or if it just spans all the nodes, we can
> + * safely avoid dealing with numa-affinity entirely.
> *
> - * Node-affinity is also deemed meaningless in case it has empty
> - * intersection with mask, to cover the cases where using the node-affinity
> + * A vcpu's numa-affinity is also deemed meaningless in case it has empty
> + * intersection with mask, to cover the cases where using the numa-affinity
> * mask seems legit, but would instead led to trying to schedule the vcpu
> * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's
> * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus
> @@ -308,11 +279,9 @@ static void csched_set_node_affinity(
> static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
> const cpumask_t *mask)
> {
> - const struct domain *d = vc->domain;
> - const struct csched_dom *sdom = CSCHED_DOM(d);
> -
> - if (cpumask_full(sdom->node_affinity_cpumask)
> - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
> + if ( vc->auto_node_affinity == 1
> + || cpumask_full(vc->node_affinity)
> + || !cpumask_intersects(vc->node_affinity, mask) )
Isn't cpumask_full(vc->node_affinity) <=> vc->auto_node_affinity at this
point?
Other than that, I think this whole patch looks good -- nice when you
can add a feature and simplify the code. :-)
-George
next prev parent reply other threads:[~2013-11-05 16:20 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-05 14:33 [PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1 Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode Dario Faggioli
2013-11-05 14:43 ` George Dunlap
2013-11-05 14:34 ` [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning Dario Faggioli
2013-11-05 14:50 ` George Dunlap
2013-11-06 8:48 ` Dario Faggioli
2013-11-07 18:17 ` Ian Jackson
2013-11-08 9:24 ` Dario Faggioli
2013-11-08 15:20 ` Ian Jackson
2013-11-05 14:34 ` [PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin' Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning) Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity Dario Faggioli
2013-11-05 14:52 ` Jan Beulich
2013-11-05 15:03 ` George Dunlap
2013-11-05 15:11 ` Jan Beulich
2013-11-05 15:24 ` George Dunlap
2013-11-05 22:15 ` Dario Faggioli
2013-11-05 15:11 ` George Dunlap
2013-11-05 15:23 ` Jan Beulich
2013-11-05 15:39 ` George Dunlap
2013-11-05 16:56 ` George Dunlap
2013-11-05 17:16 ` George Dunlap
2013-11-05 17:30 ` Jan Beulich
2013-11-05 23:12 ` Dario Faggioli
2013-11-05 23:01 ` Dario Faggioli
2013-11-06 9:39 ` Dario Faggioli
2013-11-06 9:46 ` Jan Beulich
2013-11-06 10:00 ` Dario Faggioli
2013-11-06 11:44 ` George Dunlap
2013-11-06 14:26 ` Dario Faggioli
2013-11-06 14:56 ` George Dunlap
2013-11-06 15:14 ` Jan Beulich
2013-11-06 16:12 ` George Dunlap
2013-11-06 16:22 ` Jan Beulich
2013-11-06 16:48 ` Dario Faggioli
2013-11-06 16:20 ` Dario Faggioli
2013-11-06 16:23 ` Dario Faggioli
2013-11-05 17:24 ` Jan Beulich
2013-11-05 17:31 ` George Dunlap
2013-11-05 23:08 ` Dario Faggioli
2013-11-05 22:54 ` Dario Faggioli
2013-11-05 22:22 ` Dario Faggioli
2013-11-06 11:41 ` Dario Faggioli
2013-11-06 14:47 ` George Dunlap
2013-11-06 16:53 ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling Dario Faggioli
2013-11-05 16:20 ` George Dunlap [this message]
2013-11-06 9:15 ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 09/12] libxc: " Dario Faggioli
2013-11-07 18:27 ` Ian Jackson
2013-11-12 16:01 ` Konrad Rzeszutek Wilk
2013-11-12 16:43 ` George Dunlap
2013-11-12 16:55 ` Konrad Rzeszutek Wilk
2013-11-12 18:40 ` Dario Faggioli
2013-11-12 19:13 ` Konrad Rzeszutek Wilk
2013-11-12 21:36 ` Dario Faggioli
2013-11-13 10:57 ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 10/12] libxl: " Dario Faggioli
2013-11-07 18:29 ` Ian Jackson
2013-11-08 9:18 ` Dario Faggioli
2013-11-08 15:07 ` Ian Jackson
2013-11-05 14:36 ` [PATCH RESEND 11/12] xl: " Dario Faggioli
2013-11-07 18:33 ` Ian Jackson
2013-11-08 9:33 ` Dario Faggioli
2013-11-08 15:18 ` Ian Jackson
2013-11-05 14:36 ` [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file Dario Faggioli
2013-11-07 18:35 ` Ian Jackson
2013-11-08 9:49 ` Dario Faggioli
2013-11-08 15:22 ` Ian Jackson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52791AE0.2000102@eu.citrix.com \
--to=george.dunlap@eu.citrix.com \
--cc=Andrew.Cooper3@citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=Marcus.Granado@eu.citrix.com \
--cc=dario.faggioli@citrix.com \
--cc=dgdegra@tycho.nsa.gov \
--cc=jtweaver@hawaii.edu \
--cc=juergen.gross@ts.fujitsu.com \
--cc=keir@xen.org \
--cc=lccycc123@gmail.com \
--cc=msw@amazon.com \
--cc=ufimtseva@gmail.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).