From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH v6 2/9] xen: sched: introduce soft-affinity and use it instead d->node-affinity Date: Mon, 2 Jun 2014 15:40:53 +0100 Message-ID: <538C8CF5.8060107@eu.citrix.com> References: <1401237770-7003-1-git-send-email-dario.faggioli@citrix.com> <1401237770-7003-3-git-send-email-dario.faggioli@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta14.messagelabs.com ([193.109.254.103]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1WrTQi-0007NT-Rc for xen-devel@lists.xenproject.org; Mon, 02 Jun 2014 14:41:41 +0000 In-Reply-To: <1401237770-7003-3-git-send-email-dario.faggioli@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli , xen-devel Cc: AndrewCooper , KeirFraser , Ian Jackson , Ian Campbell , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 05/28/2014 01:42 AM, Dario Faggioli wrote: > Before this change, each vcpu had its own vcpu-affinity > (in v->cpu_affinity), representing the set of pcpus where > the vcpu is allowed to run. Since when NUMA-aware scheduling > was introduced the (credit1 only, for now) scheduler also > tries as much as it can to run all the vcpus of a domain > on one of the nodes that constitutes the domain's > node-affinity. > > The idea here is making the mechanism more general by: > * allowing for this 'preference' for some pcpus/nodes to be > expressed on a per-vcpu basis, instead than for the domain > as a whole. That is to say, each vcpu should have its own > set of preferred pcpus/nodes, instead than it being the > very same for all the vcpus of the domain; > * generalizing the idea of 'preferred pcpus' to not only NUMA > awareness and support. That is to say, independently from > it being or not (mostly) useful on NUMA systems, it should > be possible to specify, for each vcpu, a set of pcpus where > it prefers to run (in addition, and possibly unrelated to, > the set of pcpus where it is allowed to run). > > We will be calling this set of *preferred* pcpus the vcpu's > soft affinity, and this changes introduce it, and starts using it > for scheduling, replacing the indirect use of the domain's NUMA > node-affinity. This is more general, as soft affinity does not > have to be related to NUMA. Nevertheless, it allows to achieve the > same results of NUMA-aware scheduling, just by making soft affinity > equal to the domain's node affinity, for all the vCPUs (e.g., > from the toolstack). > > This also means renaming most of the NUMA-aware scheduling related > functions, in credit1, to something more generic, hinting toward > the concept of soft affinity rather than directly to NUMA awareness. > > As a side effects, this simplifies the code quit a bit. In fact, > prior to this change, we needed to cache the translation of > d->node_affinity (which is a nodemask_t) to a cpumask_t, since that > is what scheduling decisions require (we used to keep it in > node_affinity_cpumask). This, and all the complicated logic > required to keep it updated, is not necessary any longer. > > The high level description of NUMA placement and scheduling in > docs/misc/xl-numa-placement.markdown is being updated too, to match > the new architecture. > > Signed-off-by: Dario Faggioli > Reviewed-by: George Dunlap > --- > Changes from v2: > * this patch folds patches 6 ("xen: sched: make space for > cpu_soft_affinity") and 10 ("xen: sched: use soft-affinity > instead of domain's node-affinity"), as suggested during > review. 'Reviewed-by' from George is there since both patch > 6 and 10 had it, and I didn't do anything else than squashing > them. > > Changes from v1: > * in v1, "7/12 xen: numa-sched: use per-vcpu node-affinity for > actual scheduling" was doing something very similar to this > patch. > --- > docs/misc/xl-numa-placement.markdown | 148 ++++++++++++++++++++------------ > xen/common/domain.c | 5 +- > xen/common/keyhandler.c | 2 + > xen/common/sched_credit.c | 153 +++++++++++++--------------------- > xen/common/schedule.c | 3 + > xen/include/xen/sched.h | 3 + > 6 files changed, 168 insertions(+), 146 deletions(-) > > diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown > index caa3fec..b1ed361 100644 > --- a/docs/misc/xl-numa-placement.markdown > +++ b/docs/misc/xl-numa-placement.markdown > @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually > defined as a set of processor cores (typically a physical CPU package) and > the memory directly attached to the set of cores. > > -The Xen hypervisor deals with NUMA machines by assigning to each domain > -a "node affinity", i.e., a set of NUMA nodes of the host from which they > -get their memory allocated. Also, even if the node affinity of a domain > -is allowed to change on-line, it is very important to "place" the domain > -correctly when it is fist created, as the most of its memory is allocated > -at that time and can not (for now) be moved easily. > - > NUMA awareness becomes very important as soon as many domains start > running memory-intensive workloads on a shared host. In fact, the cost > of accessing non node-local memory locations is very high, and the > @@ -27,14 +20,37 @@ performance degradation is likely to be noticeable. > For more information, have a look at the [Xen NUMA Introduction][numa_intro] > page on the Wiki. > > +## Xen and NUMA machines: the concept of _node-affinity_ ## > + > +The Xen hypervisor deals with NUMA machines throughout the concept of > +_node-affinity_. The node-affinity of a domain is the set of NUMA nodes > +of the host where the memory for the domain is being allocated (mostly, > +at domain creation time). This is, at least in principle, different and > +unrelated with the vCPU (hard and soft, see below) scheduling affinity, > +which instead is the set of pCPUs where the vCPU is allowed (or prefers) > +to run. > + > +Of course, despite the fact that they belong to and affect different > +subsystems, the domain node-affinity and the vCPUs affinity are not > +completely independent. > +In fact, if the domain node-affinity is not explicitly specified by the > +user, via the proper libxl calls or xl config item, it will be computed > +basing on the vCPUs' scheduling affinity. > + > +Notice that, even if the node affinity of a domain may change on-line, > +it is very important to "place" the domain correctly when it is fist > +created, as the most of its memory is allocated at that time and can > +not (for now) be moved easily. > + > ### Placing via pinning and cpupools ### > > -The simplest way of placing a domain on a NUMA node is statically pinning > -the domain's vCPUs to the pCPUs of the node. This goes under the name of > -CPU affinity and can be set through the "cpus=" option in the config file > -(more about this below). Another option is to pool together the pCPUs > -spanning the node and put the domain in such a cpupool with the "pool=" > -config option (as documented in our [Wiki][cpupools_howto]). > +The simplest way of placing a domain on a NUMA node is setting the hard > +scheduling affinity of the domain's vCPUs to the pCPUs of the node. This > +also goes under the name of vCPU pinning, and can be done through the > +"cpus=" option in the config file (more about this below). Another option > +is to pool together the pCPUs spanning the node and put the domain in > +such a _cpupool_ with the "pool=" config option (as documented in our > +[Wiki][cpupools_howto]). > > In both the above cases, the domain will not be able to execute outside > the specified set of pCPUs for any reasons, even if all those pCPUs are > @@ -45,24 +61,45 @@ may come at he cost of some load imbalances. > > ### NUMA aware scheduling ### > > -If the credit scheduler is in use, the concept of node affinity defined > -above does not only apply to memory. In fact, starting from Xen 4.3, the > -scheduler always tries to run the domain's vCPUs on one of the nodes in > -its node affinity. Only if that turns out to be impossible, it will just > -pick any free pCPU. > - > -This is, therefore, something more flexible than CPU affinity, as a domain > -can still run everywhere, it just prefers some nodes rather than others. > -Locality of access is less guaranteed than in the pinning case, but that > -comes along with better chances to exploit all the host resources (e.g., > -the pCPUs). > - > -In fact, if all the pCPUs in a domain's node affinity are busy, it is > -possible for the domain to run outside of there, but it is very likely that > -slower execution (due to remote memory accesses) is still better than no > -execution at all, as it would happen with pinning. For this reason, NUMA > -aware scheduling has the potential of bringing substantial performances > -benefits, although this will depend on the workload. > +If using the credit1 scheduler, and starting from Xen 4.3, the scheduler > +itself always tries to run the domain's vCPUs on one of the nodes in > +its node-affinity. Only if that turns out to be impossible, it will just > +pick any free pCPU. Locality of access is less guaranteed than in the > +pinning case, but that comes along with better chances to exploit all > +the host resources (e.g., the pCPUs). > + > +Starting from Xen 4.4, credit1 supports two forms of affinity: hard and Just noticed, you need to s/4.4/4.5/g; throughout this whole hunk. Other than that, the Reviewed-by stands. -George