Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`

All of lore.kernel.org
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`
Date: Fri, 27 Apr 2012 15:45:12 +0100	[thread overview]
Message-ID: <4F9AB0F8.10102@eu.citrix.com> (raw)
In-Reply-To: <1f4b55806de9e7109ff6.1334150274@Solace>

On 11/04/12 14:17, Dario Faggioli wrote:
> Basic idea is: cpu-affinity tells where a vcpu can run,
> while node-affinity tells where (in terms of on what NUMA nodes,
> but that of course imply a set of CPUs) all the vcpus of the
> domain should/prefer to run... Problems starts when you have
> both of them speaking at the same time!
>
> Respecting vcpu-affinity should remain the primary concern, thus
> what we do here is add some two-step logic here and there in the
> scheduler (i.e., where vcpu-affinity related decisions are being
> undertaken). During the first step, we check the `&&` of vcpu and
> node affinity masks, aiming at giving some precedence to the CPUs
> that are both suitable and preferrable for the domain. However,
> if that fails in finding a valid CPU, the node-affinity is just
> ignored and, in the second step, we just fallback to vcpu-affinity,
> as the original behaviour was.
>
> Both the introduced overhead and the benefits this provides has
> to be investigated and compared against each other, possibly in
> varying conditions and running different workloads.
>
> Finally, although still at the RFC level, efforts have been made
> to write the code in a flexible enough fashion so that, if we
> ever want to introduce further balancing steps, it shouldn't be
> too much of a pain.
>
> TODO: * Better investigate and test interactions with cpupools.
>        * Test, verify and benchmark. Then test, verify and benchmark
>          again, and again and again. What I know as of know is that
>          it does not explode that easily, but whether it works properly
>          and, more important, brings any benefit, this has to be
>          proven (and yes, I'm out running these tests and benchs already,
>          but do not esitate to manifest your will for helping with
>          that :-D).
>        * Add some counter/stats, e.g., to serve the purpose outlined
>          right above.
>
> XXX I'm benchmarking this just right now, and peeking at the results
>      they don't seem too bad. Numbers are mostly close to the case where
>      the VM is already pinned to a node when created. I'll post the
>      results as soon as I can, and I'll be happy to investigate any issue,
>      if you feel like the approach could be the right one.
Hey Dario,

Sorry for the long delay in reviewing this.

Overall I think the approach is good.  A few points:
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
> +#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
> +
> +
This thing of defining "START_STEP" and "END_STEP" seems a bit fragile.  
I think it would be better to always start at 0, and go until 
CSCHED_BALANCE_MAX.
> +    /*
> +     * Let's cache domain's dom->node_affinity here as an
> +     * optimization for a couple of hot paths. In fact,
> +     * knowing  whether or not dom->node_affinity has changed
> +     * would allow us to avoid rebuilding node_affinity_cpumask
> +     * (below) duing node balancing and/or scheduling.
> +     */
> +    nodemask_t node_affinity_cache;
> +    /* Basing on what dom->node_affinity says,
> +     * on what CPUs would we like to run most? */
> +    cpumask_t node_affinity_cpumask;
I think the comments here need to be more clear.  The main points are:
* node_affinity_cpumask is the dom->node_affinity translated from a 
nodemask into a cpumask
* Because doing the nodemask -> cpumask translation may be expensive, 
node_affinity_cache stores the last translated value, so we can avoid 
doing the translation if nothing has changed.

> +#define csched_balance(__step) \
> +    for ( (__step) = CSCHED_BALANCE_START_STEP; \
> +          (__step)>= CSCHED_BALANCE_END_STEP; \
> +          (__step)-- )
I don't like this.  For one, it's fragile; if for some reason you 
switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would 
go the wrong direction.

I don't think there's any reason not to just use "for(step=0; step < 
CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know 
exactly what's going on without having to go search for the macro.

> +
> +/*
> + * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
> + * i.e., a cpumask containing all the cpus from all the set nodes in the
> + * node-affinity mask of the domain.
This needs to be clearer -- vcpu-affinity doesn't have anything to do 
with this function, and there's nothing "sort-of" about the conversion. :-)

I think you mean to say, "Create a cpumask from the node affinity mask."
> + *
> + * Notice that we completely forget about serializing this (both here and
> + * in the various sites where node_affinity_cpumask is used) against
> + * d->node_affinity_lock. That would be hard to do properly, as that lock
> + * is a non IRQ-safe one, and it would be nearly impossible to access it
> + * from the scheduling code. However, although this leaves some room for
> + * races with code paths that updates d->node_affinity, it all should still
> + * be fine, considering:
> + *  (1) d->node_affinity updates are going to be quite rare;
> + *  (2) this balancing logic is kind-of "best effort" anyway;
> + *  (3) given (1), any inconsistencies are likely to be fixed by the next
> + *      call to this same function without risking going into instability.
> + *
> + * XXX Validate (via testing/benchmarking) whether this is true, as it
> + *     likely sounds to be, or it causes unexpected issues.
Ack.
> +/* Put together the cpumask we are going to use for this balancing step */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    struct domain *d = vc->domain;
> +    struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +    /*
> +     * Only two possible steps exists for now: node and vcpu affinity.
> +     * If this domain does not have a node-affinity or is affine to all the
> +     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
> +     * stop the balancing process.
> +     */
> +    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
> +         step == CSCHED_BALANCE_CPU_AFFINITY )
> +    {
> +        cpumask_copy(mask, vc->cpu_affinity);
> +        return CSCHED_BALANCE_CPU_AFFINITY;
> +    }
Hmm -- this mechanism seems kind of fragile.  It seems like returning -1 
or something, and having the caller call "continue", would be easier for 
future coders (potentially you or I) to reason about.
> +    /*
> +     * XXX As of now (with only two possible steps, this is easy and readable
> +     *     enough. If more steps become necessary at some point in time, we
> +     *     can keep an array of cpumask_t in dom_data and return the proper
> +     *     element via step-indexing such an array. Of course, we can turn
> +     *     this like that even right now... Thoughts?
> +     */
Beware of early optimization. :-)  Just make sure to mark this down for 
something to look at for profiling.
>   static inline void
> +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
> +{
> +    CSCHED_STAT_CRANK(tickle_idlers_some);
> +    if ( opt_tickle_one_idle )
> +    {
> +        this_cpu(last_tickle_cpu) =
> +            cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
> +        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
> +    }
> +    else
> +        cpumask_or(mask, mask, idle_mask);
> +}
I don't see any reason to make this into a function -- it's only called 
once, and it's not that long.  Unless you're concerned about too many 
indentations making the lines too short?
> +            csched_balance(balance_step)
>               {
Again, I'd make this a for() loop.
>      sdom->dom = dom;
> +    /*
> +     *XXX This would be 'The Right Thing', but as it is still too
> +     *    early and d->node_affinity has not settled yet, maybe we
> +     *    can just init the two masks with something like all-nodes
> +     *    and all-cpus and rely on the first balancing call for
> +     *    having them updated?
> +     */
> +    csched_build_balance_cpumask(sdom);
We might as well do what you've got here, unless it's likely to produce 
garbage.  This isn't exactly a hot path. :-)

Other than that, looks good -- Thanks!

  -George

next prev parent reply	other threads:[~2012-04-27 14:45 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap [this message]
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap
2012-05-03 14:58         ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F9AB0F8.10102@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=raistlin@linux.it \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.