From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the
 scheduler know about `node affinity`
Date: Fri, 27 Apr 2012 15:45:12 +0100
Message-ID: <4F9AB0F8.10102@eu.citrix.com>
References: <patchbomb.1334150267@Solace>
	<1f4b55806de9e7109ff6.1334150274@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1f4b55806de9e7109ff6.1334150274@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

On 11/04/12 14:17, Dario Faggioli wrote:
> Basic idea is: cpu-affinity tells where a vcpu can run,
> while node-affinity tells where (in terms of on what NUMA nodes,
> but that of course imply a set of CPUs) all the vcpus of the
> domain should/prefer to run... Problems starts when you have
> both of them speaking at the same time!
>
> Respecting vcpu-affinity should remain the primary concern, thus
> what we do here is add some two-step logic here and there in the
> scheduler (i.e., where vcpu-affinity related decisions are being
> undertaken). During the first step, we check the `&&` of vcpu and
> node affinity masks, aiming at giving some precedence to the CPUs
> that are both suitable and preferrable for the domain. However,
> if that fails in finding a valid CPU, the node-affinity is just
> ignored and, in the second step, we just fallback to vcpu-affinity,
> as the original behaviour was.
>
> Both the introduced overhead and the benefits this provides has
> to be investigated and compared against each other, possibly in
> varying conditions and running different workloads.
>
> Finally, although still at the RFC level, efforts have been made
> to write the code in a flexible enough fashion so that, if we
> ever want to introduce further balancing steps, it shouldn't be
> too much of a pain.
>
> TODO: * Better investigate and test interactions with cpupools.
>        * Test, verify and benchmark. Then test, verify and benchmark
>          again, and again and again. What I know as of know is that
>          it does not explode that easily, but whether it works properly
>          and, more important, brings any benefit, this has to be
>          proven (and yes, I'm out running these tests and benchs already,
>          but do not esitate to manifest your will for helping with
>          that :-D).
>        * Add some counter/stats, e.g., to serve the purpose outlined
>          right above.
>
> XXX I'm benchmarking this just right now, and peeking at the results
>      they don't seem too bad. Numbers are mostly close to the case where
>      the VM is already pinned to a node when created. I'll post the
>      results as soon as I can, and I'll be happy to investigate any issue,
>      if you feel like the approach could be the right one.
Hey Dario,

Sorry for the long delay in reviewing this.

Overall I think the approach is good.  A few points:
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
> +#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
> +
> +
This thing of defining "START_STEP" and "END_STEP" seems a bit fragile.  
I think it would be better to always start at 0, and go until 
CSCHED_BALANCE_MAX.
> +    /*
> +     * Let's cache domain's dom->node_affinity here as an
> +     * optimization for a couple of hot paths. In fact,
> +     * knowing  whether or not dom->node_affinity has changed
> +     * would allow us to avoid rebuilding node_affinity_cpumask
> +     * (below) duing node balancing and/or scheduling.
> +     */
> +    nodemask_t node_affinity_cache;
> +    /* Basing on what dom->node_affinity says,
> +     * on what CPUs would we like to run most? */
> +    cpumask_t node_affinity_cpumask;
I think the comments here need to be more clear.  The main points are:
* node_affinity_cpumask is the dom->node_affinity translated from a 
nodemask into a cpumask
* Because doing the nodemask -> cpumask translation may be expensive, 
node_affinity_cache stores the last translated value, so we can avoid 
doing the translation if nothing has changed.

> +#define csched_balance(__step) \
> +    for ( (__step) = CSCHED_BALANCE_START_STEP; \
> +          (__step)>= CSCHED_BALANCE_END_STEP; \
> +          (__step)-- )
I don't like this.  For one, it's fragile; if for some reason you 
switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would 
go the wrong direction.

I don't think there's any reason not to just use "for(step=0; step < 
CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know 
exactly what's going on without having to go search for the macro.

> +
> +/*
> + * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
> + * i.e., a cpumask containing all the cpus from all the set nodes in the
> + * node-affinity mask of the domain.
This needs to be clearer -- vcpu-affinity doesn't have anything to do 
with this function, and there's nothing "sort-of" about the conversion. :-)

I think you mean to say, "Create a cpumask from the node affinity mask."
> + *
> + * Notice that we completely forget about serializing this (both here and
> + * in the various sites where node_affinity_cpumask is used) against
> + * d->node_affinity_lock. That would be hard to do properly, as that lock
> + * is a non IRQ-safe one, and it would be nearly impossible to access it
> + * from the scheduling code. However, although this leaves some room for
> + * races with code paths that updates d->node_affinity, it all should still
> + * be fine, considering:
> + *  (1) d->node_affinity updates are going to be quite rare;
> + *  (2) this balancing logic is kind-of "best effort" anyway;
> + *  (3) given (1), any inconsistencies are likely to be fixed by the next
> + *      call to this same function without risking going into instability.
> + *
> + * XXX Validate (via testing/benchmarking) whether this is true, as it
> + *     likely sounds to be, or it causes unexpected issues.
Ack.
> +/* Put together the cpumask we are going to use for this balancing step */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    struct domain *d = vc->domain;
> +    struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +    /*
> +     * Only two possible steps exists for now: node and vcpu affinity.
> +     * If this domain does not have a node-affinity or is affine to all the
> +     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
> +     * stop the balancing process.
> +     */
> +    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
> +         step == CSCHED_BALANCE_CPU_AFFINITY )
> +    {
> +        cpumask_copy(mask, vc->cpu_affinity);
> +        return CSCHED_BALANCE_CPU_AFFINITY;
> +    }
Hmm -- this mechanism seems kind of fragile.  It seems like returning -1 
or something, and having the caller call "continue", would be easier for 
future coders (potentially you or I) to reason about.
> +    /*
> +     * XXX As of now (with only two possible steps, this is easy and readable
> +     *     enough. If more steps become necessary at some point in time, we
> +     *     can keep an array of cpumask_t in dom_data and return the proper
> +     *     element via step-indexing such an array. Of course, we can turn
> +     *     this like that even right now... Thoughts?
> +     */
Beware of early optimization. :-)  Just make sure to mark this down for 
something to look at for profiling.
>   static inline void
> +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
> +{
> +    CSCHED_STAT_CRANK(tickle_idlers_some);
> +    if ( opt_tickle_one_idle )
> +    {
> +        this_cpu(last_tickle_cpu) =
> +            cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
> +        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
> +    }
> +    else
> +        cpumask_or(mask, mask, idle_mask);
> +}
I don't see any reason to make this into a function -- it's only called 
once, and it's not that long.  Unless you're concerned about too many 
indentations making the lines too short?
> +            csched_balance(balance_step)
>               {
Again, I'd make this a for() loop.
>      sdom->dom = dom;
> +    /*
> +     *XXX This would be 'The Right Thing', but as it is still too
> +     *    early and d->node_affinity has not settled yet, maybe we
> +     *    can just init the two masks with something like all-nodes
> +     *    and all-cpus and rely on the first balancing call for
> +     *    having them updated?
> +     */
> +    csched_build_balance_cpumask(sdom);
We might as well do what you've got here, unless it's likely to produce 
garbage.  This isn't exactly a hot path. :-)

Other than that, looks good -- Thanks!

  -George