Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`
Date: Fri, 27 Apr 2012 15:45:12 +0100	[thread overview]
Message-ID: <4F9AB0F8.10102@eu.citrix.com> (raw)
In-Reply-To: <1f4b55806de9e7109ff6.1334150274@Solace>

On 11/04/12 14:17, Dario Faggioli wrote:
> Basic idea is: cpu-affinity tells where a vcpu can run,
> while node-affinity tells where (in terms of on what NUMA nodes,
> but that of course imply a set of CPUs) all the vcpus of the
> domain should/prefer to run... Problems starts when you have
> both of them speaking at the same time!
>
> Respecting vcpu-affinity should remain the primary concern, thus
> what we do here is add some two-step logic here and there in the
> scheduler (i.e., where vcpu-affinity related decisions are being
> undertaken). During the first step, we check the `&&` of vcpu and
> node affinity masks, aiming at giving some precedence to the CPUs
> that are both suitable and preferrable for the domain. However,
> if that fails in finding a valid CPU, the node-affinity is just
> ignored and, in the second step, we just fallback to vcpu-affinity,
> as the original behaviour was.
>
> Both the introduced overhead and the benefits this provides has
> to be investigated and compared against each other, possibly in
> varying conditions and running different workloads.
>
> Finally, although still at the RFC level, efforts have been made
> to write the code in a flexible enough fashion so that, if we
> ever want to introduce further balancing steps, it shouldn't be
> too much of a pain.
>
> TODO: * Better investigate and test interactions with cpupools.
>        * Test, verify and benchmark. Then test, verify and benchmark
>          again, and again and again. What I know as of know is that
>          it does not explode that easily, but whether it works properly
>          and, more important, brings any benefit, this has to be
>          proven (and yes, I'm out running these tests and benchs already,
>          but do not esitate to manifest your will for helping with
>          that :-D).
>        * Add some counter/stats, e.g., to serve the purpose outlined
>          right above.
>
> XXX I'm benchmarking this just right now, and peeking at the results
>      they don't seem too bad. Numbers are mostly close to the case where
>      the VM is already pinned to a node when created. I'll post the
>      results as soon as I can, and I'll be happy to investigate any issue,
>      if you feel like the approach could be the right one.
Hey Dario,

Sorry for the long delay in reviewing this.

Overall I think the approach is good.  A few points:
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
> +#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
> +
> +
This thing of defining "START_STEP" and "END_STEP" seems a bit fragile.  
I think it would be better to always start at 0, and go until 
CSCHED_BALANCE_MAX.
> +    /*
> +     * Let's cache domain's dom->node_affinity here as an
> +     * optimization for a couple of hot paths. In fact,
> +     * knowing  whether or not dom->node_affinity has changed
> +     * would allow us to avoid rebuilding node_affinity_cpumask
> +     * (below) duing node balancing and/or scheduling.
> +     */
> +    nodemask_t node_affinity_cache;
> +    /* Basing on what dom->node_affinity says,
> +     * on what CPUs would we like to run most? */
> +    cpumask_t node_affinity_cpumask;
I think the comments here need to be more clear.  The main points are:
* node_affinity_cpumask is the dom->node_affinity translated from a 
nodemask into a cpumask
* Because doing the nodemask -> cpumask translation may be expensive, 
node_affinity_cache stores the last translated value, so we can avoid 
doing the translation if nothing has changed.

> +#define csched_balance(__step) \
> +    for ( (__step) = CSCHED_BALANCE_START_STEP; \
> +          (__step)>= CSCHED_BALANCE_END_STEP; \
> +          (__step)-- )
I don't like this.  For one, it's fragile; if for some reason you 
switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would 
go the wrong direction.

I don't think there's any reason not to just use "for(step=0; step < 
CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know 
exactly what's going on without having to go search for the macro.

> +
> +/*
> + * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
> + * i.e., a cpumask containing all the cpus from all the set nodes in the
> + * node-affinity mask of the domain.
This needs to be clearer -- vcpu-affinity doesn't have anything to do 
with this function, and there's nothing "sort-of" about the conversion. :-)

I think you mean to say, "Create a cpumask from the node affinity mask."
> + *
> + * Notice that we completely forget about serializing this (both here and
> + * in the various sites where node_affinity_cpumask is used) against
> + * d->node_affinity_lock. That would be hard to do properly, as that lock
> + * is a non IRQ-safe one, and it would be nearly impossible to access it
> + * from the scheduling code. However, although this leaves some room for
> + * races with code paths that updates d->node_affinity, it all should still
> + * be fine, considering:
> + *  (1) d->node_affinity updates are going to be quite rare;
> + *  (2) this balancing logic is kind-of "best effort" anyway;
> + *  (3) given (1), any inconsistencies are likely to be fixed by the next
> + *      call to this same function without risking going into instability.
> + *
> + * XXX Validate (via testing/benchmarking) whether this is true, as it
> + *     likely sounds to be, or it causes unexpected issues.
Ack.
> +/* Put together the cpumask we are going to use for this balancing step */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    struct domain *d = vc->domain;
> +    struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +    /*
> +     * Only two possible steps exists for now: node and vcpu affinity.
> +     * If this domain does not have a node-affinity or is affine to all the
> +     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
> +     * stop the balancing process.
> +     */
> +    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
> +         step == CSCHED_BALANCE_CPU_AFFINITY )
> +    {
> +        cpumask_copy(mask, vc->cpu_affinity);
> +        return CSCHED_BALANCE_CPU_AFFINITY;
> +    }
Hmm -- this mechanism seems kind of fragile.  It seems like returning -1 
or something, and having the caller call "continue", would be easier for 
future coders (potentially you or I) to reason about.
> +    /*
> +     * XXX As of now (with only two possible steps, this is easy and readable
> +     *     enough. If more steps become necessary at some point in time, we
> +     *     can keep an array of cpumask_t in dom_data and return the proper
> +     *     element via step-indexing such an array. Of course, we can turn
> +     *     this like that even right now... Thoughts?
> +     */
Beware of early optimization. :-)  Just make sure to mark this down for 
something to look at for profiling.
>   static inline void
> +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
> +{
> +    CSCHED_STAT_CRANK(tickle_idlers_some);
> +    if ( opt_tickle_one_idle )
> +    {
> +        this_cpu(last_tickle_cpu) =
> +            cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
> +        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
> +    }
> +    else
> +        cpumask_or(mask, mask, idle_mask);
> +}
I don't see any reason to make this into a function -- it's only called 
once, and it's not that long.  Unless you're concerned about too many 
indentations making the lines too short?
> +            csched_balance(balance_step)
>               {
Again, I'd make this a for() loop.
>      sdom->dom = dom;
> +    /*
> +     *XXX This would be 'The Right Thing', but as it is still too
> +     *    early and d->node_affinity has not settled yet, maybe we
> +     *    can just init the two masks with something like all-nodes
> +     *    and all-cpus and rely on the first balancing call for
> +     *    having them updated?
> +     */
> +    csched_build_balance_cpumask(sdom);
We might as well do what you've got here, unless it's likely to produce 
garbage.  This isn't exactly a hot path. :-)

Other than that, looks good -- Thanks!

  -George

next prev parent reply	other threads:[~2012-04-27 14:45 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap [this message]
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap
2012-05-03 14:58         ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F9AB0F8.10102@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=raistlin@linux.it \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).