From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH v6 2/9] xen: sched: introduce soft-affinity
 and use it instead d->node-affinity
Date: Mon, 2 Jun 2014 15:40:53 +0100
Message-ID: <538C8CF5.8060107@eu.citrix.com>
References: <1401237770-7003-1-git-send-email-dario.faggioli@citrix.com>
	<1401237770-7003-3-git-send-email-dario.faggioli@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta14.messagelabs.com ([193.109.254.103])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <George.Dunlap@citrix.com>) id 1WrTQi-0007NT-Rc
	for xen-devel@lists.xenproject.org; Mon, 02 Jun 2014 14:41:41 +0000
In-Reply-To: <1401237770-7003-3-git-send-email-dario.faggioli@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>, xen-devel <xen-devel@lists.xenproject.org>
Cc: AndrewCooper <Andrew.Cooper3@citrix.com>, KeirFraser <keir@xen.org>, Ian Jackson <Ian.Jackson@eu.citrix.com>, Ian Campbell <Ian.Campbell@citrix.com>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

On 05/28/2014 01:42 AM, Dario Faggioli wrote:
> Before this change, each vcpu had its own vcpu-affinity
> (in v->cpu_affinity), representing the set of pcpus where
> the vcpu is allowed to run. Since when NUMA-aware scheduling
> was introduced the (credit1 only, for now) scheduler also
> tries as much as it can to run all the vcpus of a domain
> on one of the nodes that constitutes the domain's
> node-affinity.
>
> The idea here is making the mechanism more general by:
>    * allowing for this 'preference' for some pcpus/nodes to be
>      expressed on a per-vcpu basis, instead than for the domain
>      as a whole. That is to say, each vcpu should have its own
>      set of preferred pcpus/nodes, instead than it being the
>      very same for all the vcpus of the domain;
>    * generalizing the idea of 'preferred pcpus' to not only NUMA
>      awareness and support. That is to say, independently from
>      it being or not (mostly) useful on NUMA systems, it should
>      be possible to specify, for each vcpu, a set of pcpus where
>      it prefers to run (in addition, and possibly unrelated to,
>      the set of pcpus where it is allowed to run).
>
> We will be calling this set of *preferred* pcpus the vcpu's
> soft affinity, and this changes introduce it, and starts using it
> for scheduling, replacing the indirect use of the domain's NUMA
> node-affinity. This is more general, as soft affinity does not
> have to be related to NUMA. Nevertheless, it allows to achieve the
> same results of NUMA-aware scheduling, just by making soft affinity
> equal to the domain's node affinity, for all the vCPUs (e.g.,
> from the toolstack).
>
> This also means renaming most of the NUMA-aware scheduling related
> functions, in credit1, to something more generic, hinting toward
> the concept of soft affinity rather than directly to NUMA awareness.
>
> As a side effects, this simplifies the code quit a bit. In fact,
> prior to this change, we needed to cache the translation of
> d->node_affinity (which is a nodemask_t) to a cpumask_t, since that
> is what scheduling decisions require (we used to keep it in
> node_affinity_cpumask). This, and all the complicated logic
> required to keep it updated, is not necessary any longer.
>
> The high level description of NUMA placement and scheduling in
> docs/misc/xl-numa-placement.markdown is being updated too, to match
> the new architecture.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
> ---
> Changes from v2:
>   * this patch folds patches 6 ("xen: sched: make space for
>     cpu_soft_affinity") and 10 ("xen: sched: use soft-affinity
>     instead of domain's node-affinity"), as suggested during
>     review. 'Reviewed-by' from George is there since both patch
>     6 and 10 had it, and I didn't do anything else than squashing
>     them.
>
> Changes from v1:
>   * in v1, "7/12 xen: numa-sched: use per-vcpu node-affinity for
>     actual scheduling" was doing something very similar to this
>     patch.
> ---
>   docs/misc/xl-numa-placement.markdown |  148 ++++++++++++++++++++------------
>   xen/common/domain.c                  |    5 +-
>   xen/common/keyhandler.c              |    2 +
>   xen/common/sched_credit.c            |  153 +++++++++++++---------------------
>   xen/common/schedule.c                |    3 +
>   xen/include/xen/sched.h              |    3 +
>   6 files changed, 168 insertions(+), 146 deletions(-)
>
> diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
> index caa3fec..b1ed361 100644
> --- a/docs/misc/xl-numa-placement.markdown
> +++ b/docs/misc/xl-numa-placement.markdown
> @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually
>   defined as a set of processor cores (typically a physical CPU package) and
>   the memory directly attached to the set of cores.
>   
> -The Xen hypervisor deals with NUMA machines by assigning to each domain
> -a "node affinity", i.e., a set of NUMA nodes of the host from which they
> -get their memory allocated. Also, even if the node affinity of a domain
> -is allowed to change on-line, it is very important to "place" the domain
> -correctly when it is fist created, as the most of its memory is allocated
> -at that time and can not (for now) be moved easily.
> -
>   NUMA awareness becomes very important as soon as many domains start
>   running memory-intensive workloads on a shared host. In fact, the cost
>   of accessing non node-local memory locations is very high, and the
> @@ -27,14 +20,37 @@ performance degradation is likely to be noticeable.
>   For more information, have a look at the [Xen NUMA Introduction][numa_intro]
>   page on the Wiki.
>   
> +## Xen and NUMA machines: the concept of _node-affinity_ ##
> +
> +The Xen hypervisor deals with NUMA machines throughout the concept of
> +_node-affinity_. The node-affinity of a domain is the set of NUMA nodes
> +of the host where the memory for the domain is being allocated (mostly,
> +at domain creation time). This is, at least in principle, different and
> +unrelated with the vCPU (hard and soft, see below) scheduling affinity,
> +which instead is the set of pCPUs where the vCPU is allowed (or prefers)
> +to run.
> +
> +Of course, despite the fact that they belong to and affect different
> +subsystems, the domain node-affinity and the vCPUs affinity are not
> +completely independent.
> +In fact, if the domain node-affinity is not explicitly specified by the
> +user, via the proper libxl calls or xl config item, it will be computed
> +basing on the vCPUs' scheduling affinity.
> +
> +Notice that, even if the node affinity of a domain may change on-line,
> +it is very important to "place" the domain correctly when it is fist
> +created, as the most of its memory is allocated at that time and can
> +not (for now) be moved easily.
> +
>   ### Placing via pinning and cpupools ###
>   
> -The simplest way of placing a domain on a NUMA node is statically pinning
> -the domain's vCPUs to the pCPUs of the node. This goes under the name of
> -CPU affinity and can be set through the "cpus=" option in the config file
> -(more about this below). Another option is to pool together the pCPUs
> -spanning the node and put the domain in such a cpupool with the "pool="
> -config option (as documented in our [Wiki][cpupools_howto]).
> +The simplest way of placing a domain on a NUMA node is setting the hard
> +scheduling affinity of the domain's vCPUs to the pCPUs of the node. This
> +also goes under the name of vCPU pinning, and can be done through the
> +"cpus=" option in the config file (more about this below). Another option
> +is to pool together the pCPUs spanning the node and put the domain in
> +such a _cpupool_ with the "pool=" config option (as documented in our
> +[Wiki][cpupools_howto]).
>   
>   In both the above cases, the domain will not be able to execute outside
>   the specified set of pCPUs for any reasons, even if all those pCPUs are
> @@ -45,24 +61,45 @@ may come at he cost of some load imbalances.
>   
>   ### NUMA aware scheduling ###
>   
> -If the credit scheduler is in use, the concept of node affinity defined
> -above does not only apply to memory. In fact, starting from Xen 4.3, the
> -scheduler always tries to run the domain's vCPUs on one of the nodes in
> -its node affinity. Only if that turns out to be impossible, it will just
> -pick any free pCPU.
> -
> -This is, therefore, something more flexible than CPU affinity, as a domain
> -can still run everywhere, it just prefers some nodes rather than others.
> -Locality of access is less guaranteed than in the pinning case, but that
> -comes along with better chances to exploit all the host resources (e.g.,
> -the pCPUs).
> -
> -In fact, if all the pCPUs in a domain's node affinity are busy, it is
> -possible for the domain to run outside of there, but it is very likely that
> -slower execution (due to remote memory accesses) is still better than no
> -execution at all, as it would happen with pinning. For this reason, NUMA
> -aware scheduling has the potential of bringing substantial performances
> -benefits, although this will depend on the workload.
> +If using the credit1 scheduler, and starting from Xen 4.3, the scheduler
> +itself always tries to run the domain's vCPUs on one of the nodes in
> +its node-affinity. Only if that turns out to be impossible, it will just
> +pick any free pCPU. Locality of access is less guaranteed than in the
> +pinning case, but that comes along with better chances to exploit all
> +the host resources (e.g., the pCPUs).
> +
> +Starting from Xen 4.4, credit1 supports two forms of affinity: hard and

Just noticed, you need to  s/4.4/4.5/g; throughout this whole hunk.

Other than that, the Reviewed-by stands.

  -George