From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <juergen.gross@ts.fujitsu.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the
 scheduler know about node-affinity
Date: Thu, 20 Dec 2012 07:44:46 +0100
Message-ID: <50D2B3DE.70206@ts.fujitsu.com>
References: <patchbomb.1355944036@Solace>
	<06d2f322a6319d8ba212.1355944039@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>, Dan Magenheimer <dan.magenheimer@oracle.com>, Ian Campbell <Ian.Campbell@citrix.com>, Anil Madhavapeddy <anil@recoil.org>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel@lists.xen.org, Jan Beulich <JBeulich@suse.com>, Daniel De Graaf <dgdegra@tycho.nsa.gov>, Matt Wilson <msw@amazon.com>
List-Id: xen-devel@lists.xenproject.org

Am 19.12.2012 20:07, schrieb Dario Faggioli:
> As vcpu-affinity tells where VCPUs must run, node-affinity tells
> where they should or, better, prefer. While respecting vcpu-affinity
> remains mandatory, node-affinity is not that strict, it only expresses
> a preference, although honouring it is almost always true that will
> bring significant performances benefit (especially as compared to
> not having any affinity at all).
>
> This change modifies the VCPU load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferable
> for the domain to run. If that fails in finding a valid PCPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
> ---
> Changes from v1:
>   * CPU masks variables moved off from the stack, as requested during
>     review. As per the comments in the code, having them in the private
>     (per-scheduler instance) struct could have been enough, but it would be
>     racy (again, see comments). For that reason, use a global bunch of
>     them of (via per_cpu());

Wouldn't it be better to put the mask in the scheduler private per-pcpu area?
This could be applied to several other instances of cpu masks on the stack,
too.

>   * George suggested a different load balancing logic during v1's review. I
>     think he was right and then I changed the old implementation in a way
>     that resembles exactly that. I rewrote most of this patch to introduce
>     a more sensible and effective noda-affinity handling logic.
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -111,6 +111,33 @@
>
>
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
> +
> +/*
> + * When building for high number of CPUs, cpumask_var_t
> + * variables on stack are better avoided. However, we need them,
> + * in order to be able to consider both vcpu and node affinity.
> + * We also don't want to xmalloc()/xfree() them, as that would
> + * happen in critical code paths. Therefore, let's (pre)allocate
> + * some scratch space for them.
> + *
> + * Having one mask for each instance of the scheduler seems
> + * enough, and that would suggest putting it wihin `struct
> + * csched_private' below. However, we don't always hold the
> + * private scheduler lock when the mask itself would need to
> + * be used, leaving room for races. For that reason, we define
> + * and use a cpumask_t for each CPU. As preemption is not an
> + * issue here (we're holding the runqueue spin-lock!), that is
> + * both enough and safe.
> + */
> +DEFINE_PER_CPU(cpumask_t, csched_balance_mask);
> +#define scratch_balance_mask (this_cpu(csched_balance_mask))
> +
> +/*
>    * Boot parameters
>    */
>   static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
> @@ -159,6 +186,9 @@ struct csched_dom {
>       struct list_head active_vcpu;
>       struct list_head active_sdom_elem;
>       struct domain *dom;
> +    /* cpumask translated from the domain's node-affinity.
> +     * Basically, the CPUs we prefer to be scheduled on. */
> +    cpumask_var_t node_affinity_cpumask;
>       uint16_t active_vcpu_count;
>       uint16_t weight;
>       uint16_t cap;
> @@ -239,6 +269,42 @@ static inline void
>       list_del_init(&svc->runq_elem);
>   }
>
> +#define for_each_csched_balance_step(__step) \
> +    for ( (__step) = CSCHED_BALANCE_LAST; (__step)>= 0; (__step)-- )
> +
> +/*
> + * Each csched-balance step has to use its own cpumask. This function
> + * determines which one, given the step, and copies it in mask. Notice
> + * that, in case of node-affinity balancing step, it also filters out from
> + * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
> + * as we do not want to end up running a vcpu where it would like, but
> + * is not allowed to!
> + *
> + * As an optimization, if a domain does not have any node-affinity at all
> + * (namely, its node affinity is automatically computed), not only the
> + * computed mask will reflect its vcpu-affinity, but we also return -1 to
> + * let the caller know that he can skip the step or quit the loop (if he
> + * wants).
> + */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
> +    {
> +        struct domain *d = vc->domain;
> +        struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
> +
> +        if ( cpumask_full(sdom->node_affinity_cpumask) )
> +            return -1;
> +    }
> +    else /* step == CSCHED_BALANCE_CPU_AFFINITY */
> +        cpumask_copy(mask, vc->cpu_affinity);
> +
> +    return 0;
> +}
> +
>   static void burn_credits(struct csched_vcpu *svc, s_time_t now)
>   {
>       s_time_t delta;
> @@ -266,67 +332,94 @@ static inline void
>       struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>       cpumask_t mask, idle_mask;
> -    int idlers_empty;
> +    int balance_step, idlers_empty;
>
>       ASSERT(cur);
> -    cpumask_clear(&mask);
> -
>       idlers_empty = cpumask_empty(prv->idlers);
>
>       /*
> -     * If the pcpu is idle, or there are no idlers and the new
> -     * vcpu is a higher priority than the old vcpu, run it here.
> -     *
> -     * If there are idle cpus, first try to find one suitable to run
> -     * new, so we can avoid preempting cur.  If we cannot find a
> -     * suitable idler on which to run new, run it here, but try to
> -     * find a suitable idler on which to run cur instead.
> +     * Node and vcpu-affinity balancing loop. To speed things up, in case
> +     * no node-affinity at all is present, scratch_balance_mask reflects
> +     * the vcpu-affinity, and ret is -1, so that we then can quit the
> +     * loop after only one step.
>        */
> -    if ( cur->pri == CSCHED_PRI_IDLE
> -         || (idlers_empty&&  new->pri>  cur->pri) )
> +    for_each_csched_balance_step( balance_step )
>       {
> -        if ( cur->pri != CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -        cpumask_set_cpu(cpu,&mask);
> -    }
> -    else if ( !idlers_empty )
> -    {
> -        /* Check whether or not there are idlers that can run new */
> -        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
> +        int ret, new_idlers_empty;
> +
> +        cpumask_clear(&mask);
>
>           /*
> -         * If there are no suitable idlers for new, and it's higher
> -         * priority than cur, ask the scheduler to migrate cur away.
> -         * We have to act like this (instead of just waking some of
> -         * the idlers suitable for cur) because cur is running.
> +         * If the pcpu is idle, or there are no idlers and the new
> +         * vcpu is a higher priority than the old vcpu, run it here.
>            *
> -         * If there are suitable idlers for new, no matter priorities,
> -         * leave cur alone (as it is running and is, likely, cache-hot)
> -         * and wake some of them (which is waking up and so is, likely,
> -         * cache cold anyway).
> +         * If there are idle cpus, first try to find one suitable to run
> +         * new, so we can avoid preempting cur.  If we cannot find a
> +         * suitable idler on which to run new, run it here, but try to
> +         * find a suitable idler on which to run cur instead.
>            */
> -        if ( cpumask_empty(&idle_mask)&&  new->pri>  cur->pri )
> +        if ( cur->pri == CSCHED_PRI_IDLE
> +             || (idlers_empty&&  new->pri>  cur->pri) )
>           {
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> -            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> -            SCHED_STAT_CRANK(migrate_kicked_away);
> -            set_bit(_VPF_migrating,&cur->vcpu->pause_flags);
> +            if ( cur->pri != CSCHED_PRI_IDLE )
> +                SCHED_STAT_CRANK(tickle_idlers_none);
>               cpumask_set_cpu(cpu,&mask);
>           }
> -        else if ( !cpumask_empty(&idle_mask) )
> +        else if ( !idlers_empty )
>           {
> -            /* Which of the idlers suitable for new shall we wake up? */
> -            SCHED_STAT_CRANK(tickle_idlers_some);
> -            if ( opt_tickle_one_idle )
> +            /* Are there idlers suitable for new (for this balance step)? */
> +            ret = csched_balance_cpumask(new->vcpu, balance_step,
> +&scratch_balance_mask);
> +            cpumask_and(&idle_mask, prv->idlers,&scratch_balance_mask);
> +            new_idlers_empty = cpumask_empty(&idle_mask);
> +
> +            /*
> +             * Let's not be too harsh! If there aren't idlers suitable
> +             * for new in its node-affinity mask, make sure we check its
> +             * vcpu-affinity as well, before tacking final decisions.
> +             */
> +            if ( new_idlers_empty
> +&&  (balance_step == CSCHED_BALANCE_NODE_AFFINITY&&  !ret) )
> +                continue;
> +
> +            /*
> +             * If there are no suitable idlers for new, and it's higher
> +             * priority than cur, ask the scheduler to migrate cur away.
> +             * We have to act like this (instead of just waking some of
> +             * the idlers suitable for cur) because cur is running.
> +             *
> +             * If there are suitable idlers for new, no matter priorities,
> +             * leave cur alone (as it is running and is, likely, cache-hot)
> +             * and wake some of them (which is waking up and so is, likely,
> +             * cache cold anyway).
> +             */
> +            if ( new_idlers_empty&&  new->pri>  cur->pri )
>               {
> -                this_cpu(last_tickle_cpu) =
> -                    cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask);
> -                cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask);
> +                SCHED_STAT_CRANK(tickle_idlers_none);
> +                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> +                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> +                SCHED_STAT_CRANK(migrate_kicked_away);
> +                set_bit(_VPF_migrating,&cur->vcpu->pause_flags);
> +                cpumask_set_cpu(cpu,&mask);
>               }
> -            else
> -                cpumask_or(&mask,&mask,&idle_mask);
> +            else if ( !new_idlers_empty )
> +            {
> +                /* Which of the idlers suitable for new shall we wake up? */
> +                SCHED_STAT_CRANK(tickle_idlers_some);
> +                if ( opt_tickle_one_idle )
> +                {
> +                    this_cpu(last_tickle_cpu) =
> +                        cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask);
> +                    cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask);
> +                }
> +                else
> +                    cpumask_or(&mask,&mask,&idle_mask);
> +            }
>           }
> +
> +        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
> +        if ( !cpumask_empty(&mask) || ret )
> +            break;
>       }
>
>       if ( !cpumask_empty(&mask) )
> @@ -475,15 +568,28 @@ static inline int
>   }
>
>   static inline int
> -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
> +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>   {
>       /*
>        * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that's allowed to run on our CPU.
> +     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> +     * on our CPU.
>        */
>       return !vc->is_running&&
>              !__csched_vcpu_is_cache_hot(vc)&&
> -           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
> +           cpumask_test_cpu(dest_cpu, mask);
> +}
> +
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers)&&  cpumask_test_cpu(cpu, mask);
>   }
>
>   static int
> @@ -493,85 +599,98 @@ static int
>       cpumask_t idlers;
>       cpumask_t *online;
>       struct csched_pcpu *spc = NULL;
> +    int ret, balance_step;
>       int cpu;
>
> -    /*
> -     * Pick from online CPUs in VCPU's affinity mask, giving a
> -     * preference to its current processor if it's in there.
> -     */
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> -    cpumask_and(&cpus, online, vc->cpu_affinity);
> -    cpu = cpumask_test_cpu(vc->processor,&cpus)
> -            ? vc->processor
> -            : cpumask_cycle(vc->processor,&cpus);
> -    ASSERT( !cpumask_empty(&cpus)&&  cpumask_test_cpu(cpu,&cpus) );
> +    for_each_csched_balance_step( balance_step )
> +    {
> +        /* Pick an online CPU from the proper affinity mask */
> +        ret = csched_balance_cpumask(vc, balance_step,&cpus);
> +        cpumask_and(&cpus,&cpus, online);
>
> -    /*
> -     * Try to find an idle processor within the above constraints.
> -     *
> -     * In multi-core and multi-threaded CPUs, not all idle execution
> -     * vehicles are equal!
> -     *
> -     * We give preference to the idle execution vehicle with the most
> -     * idling neighbours in its grouping. This distributes work across
> -     * distinct cores first and guarantees we don't do something stupid
> -     * like run two VCPUs on co-hyperthreads while there are idle cores
> -     * or sockets.
> -     *
> -     * Notice that, when computing the "idleness" of cpu, we may want to
> -     * discount vc. That is, iff vc is the currently running and the only
> -     * runnable vcpu on cpu, we add cpu to the idlers.
> -     */
> -    cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers);
> -    if ( vc->processor == cpu&&  IS_RUNQ_IDLE(cpu) )
> -        cpumask_set_cpu(cpu,&idlers);
> -    cpumask_and(&cpus,&cpus,&idlers);
> -    cpumask_clear_cpu(cpu,&cpus);
> +        /* If present, prefer vc's current processor */
> +        cpu = cpumask_test_cpu(vc->processor,&cpus)
> +                ? vc->processor
> +                : cpumask_cycle(vc->processor,&cpus);
> +        ASSERT( !cpumask_empty(&cpus)&&  cpumask_test_cpu(cpu,&cpus) );
>
> -    while ( !cpumask_empty(&cpus) )
> -    {
> -        cpumask_t cpu_idlers;
> -        cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> -        int migrate_factor;
> +        /*
> +         * Try to find an idle processor within the above constraints.
> +         *
> +         * In multi-core and multi-threaded CPUs, not all idle execution
> +         * vehicles are equal!
> +         *
> +         * We give preference to the idle execution vehicle with the most
> +         * idling neighbours in its grouping. This distributes work across
> +         * distinct cores first and guarantees we don't do something stupid
> +         * like run two VCPUs on co-hyperthreads while there are idle cores
> +         * or sockets.
> +         *
> +         * Notice that, when computing the "idleness" of cpu, we may want to
> +         * discount vc. That is, iff vc is the currently running and the only
> +         * runnable vcpu on cpu, we add cpu to the idlers.
> +         */
> +        cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers);
> +        if ( vc->processor == cpu&&  IS_RUNQ_IDLE(cpu) )
> +            cpumask_set_cpu(cpu,&idlers);
> +        cpumask_and(&cpus,&cpus,&idlers);
> +        /* If there are idlers and cpu is still not among them, pick one */
> +        if ( !cpumask_empty(&cpus)&&  !cpumask_test_cpu(cpu,&cpus) )
> +            cpu = cpumask_cycle(cpu,&cpus);
> +        cpumask_clear_cpu(cpu,&cpus);
>
> -        nxt = cpumask_cycle(cpu,&cpus);
> +        while ( !cpumask_empty(&cpus) )
> +        {
> +            cpumask_t cpu_idlers;
> +            cpumask_t nxt_idlers;
> +            int nxt, weight_cpu, weight_nxt;
> +            int migrate_factor;
>
> -        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> -        {
> -            /* We're on the same socket, so check the busy-ness of threads.
> -             * Migrate if # of idlers is less at all */
> -            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 1;
> -            cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask, cpu));
> -            cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask, nxt));
> -        }
> -        else
> -        {
> -            /* We're on different sockets, so check the busy-ness of cores.
> -             * Migrate only if the other core is twice as idle */
> -            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 2;
> -            cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu));
> -            cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt));
> +            nxt = cpumask_cycle(cpu,&cpus);
> +
> +            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> +            {
> +                /* We're on the same socket, so check the busy-ness of threads.
> +                 * Migrate if # of idlers is less at all */
> +                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 1;
> +                cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask,
> +                            cpu));
> +                cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask,
> +                            nxt));
> +            }
> +            else
> +            {
> +                /* We're on different sockets, so check the busy-ness of cores.
> +                 * Migrate only if the other core is twice as idle */
> +                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 2;
> +                cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu));
> +                cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt));
> +            }
> +
> +            weight_cpu = cpumask_weight(&cpu_idlers);
> +            weight_nxt = cpumask_weight(&nxt_idlers);
> +            /* smt_power_savings: consolidate work rather than spreading it */
> +            if ( sched_smt_power_savings ?
> +                 weight_cpu>  weight_nxt :
> +                 weight_cpu * migrate_factor<  weight_nxt )
> +            {
> +                cpumask_and(&nxt_idlers,&cpus,&nxt_idlers);
> +                spc = CSCHED_PCPU(nxt);
> +                cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers);
> +                cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu));
> +            }
> +            else
> +            {
> +                cpumask_andnot(&cpus,&cpus,&nxt_idlers);
> +            }
>           }
>
> -        weight_cpu = cpumask_weight(&cpu_idlers);
> -        weight_nxt = cpumask_weight(&nxt_idlers);
> -        /* smt_power_savings: consolidate work rather than spreading it */
> -        if ( sched_smt_power_savings ?
> -             weight_cpu>  weight_nxt :
> -             weight_cpu * migrate_factor<  weight_nxt )
> -        {
> -            cpumask_and(&nxt_idlers,&cpus,&nxt_idlers);
> -            spc = CSCHED_PCPU(nxt);
> -            cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers);
> -            cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu));
> -        }
> -        else
> -        {
> -            cpumask_andnot(&cpus,&cpus,&nxt_idlers);
> -        }
> +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> +        if ( cpumask_test_cpu(cpu,&idlers) || ret )
> +            break;
>       }
>
>       if ( commit&&  spc )
> @@ -913,6 +1032,13 @@ csched_alloc_domdata(const struct schedu
>       if ( sdom == NULL )
>           return NULL;
>
> +    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
> +    {
> +        xfree(sdom);
> +        return NULL;
> +    }
> +    cpumask_setall(sdom->node_affinity_cpumask);
> +
>       /* Initialize credit and weight */
>       INIT_LIST_HEAD(&sdom->active_vcpu);
>       sdom->active_vcpu_count = 0;
> @@ -944,6 +1070,9 @@ csched_dom_init(const struct scheduler *
>   static void
>   csched_free_domdata(const struct scheduler *ops, void *data)
>   {
> +    struct csched_dom *sdom = data;
> +
> +    free_cpumask_var(sdom->node_affinity_cpumask);
>       xfree(data);
>   }
>
> @@ -1240,9 +1369,10 @@ csched_tick(void *_cpu)
>   }
>
>   static struct csched_vcpu *
> -csched_runq_steal(int peer_cpu, int cpu, int pri)
> +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> +    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
>       const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
> @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
>               if ( speer->pri<= pri )
>                   break;
>
> -            /* Is this VCPU is runnable on our PCPU? */
> +            /* Is this VCPU runnable on our PCPU? */
>               vc = speer->vcpu;
>               BUG_ON( is_idle_vcpu(vc) );
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> +            /*
> +             * Retrieve the correct mask for this balance_step or, if we're
> +             * dealing with node-affinity and the vcpu has no node affinity
> +             * at all, just skip this vcpu. That is needed if we want to
> +             * check if we have any node-affine work to steal first (wrt
> +             * any vcpu-affine work).
> +             */
> +            if ( csched_balance_cpumask(vc, balance_step,
> +&scratch_balance_mask) )
> +                continue;
> +
> +            if ( __csched_vcpu_is_migrateable(vc, cpu,&scratch_balance_mask)
> +&&  __csched_vcpu_should_migrate(cpu,&scratch_balance_mask,
> +                                                 prv->idlers) )
>               {
>                   /* We got a candidate. Grab it! */
>                   TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
>       struct csched_vcpu *speer;
>       cpumask_t workers;
>       cpumask_t *online;
> -    int peer_cpu;
> +    int peer_cpu, peer_node, bstep;
> +    int node = cpu_to_node(cpu);
>
>       BUG_ON( cpu != snext->vcpu->processor );
>       online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
> @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
>           SCHED_STAT_CRANK(load_balance_other);
>
>       /*
> -     * Peek at non-idling CPUs in the system, starting with our
> -     * immediate neighbour.
> +     * Let's look around for work to steal, taking both vcpu-affinity
> +     * and node-affinity into account. More specifically, we check all
> +     * the non-idle CPUs' runq, looking for:
> +     *  1. any node-affine work to steal first,
> +     *  2. if not finding anything, any vcpu-affine work to steal.
>        */
> -    cpumask_andnot(&workers, online, prv->idlers);
> -    cpumask_clear_cpu(cpu,&workers);
> -    peer_cpu = cpu;
> +    for_each_csched_balance_step( bstep )
> +    {
> +        /*
> +         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
> +         * it is more likely that we find some node-affine work on our same
> +         * node, not to mention that migrating vcpus within the same node
> +         * could well expected to be cheaper than across-nodes (memory
> +         * stays local, there might be some node-wide cache[s], etc.).
> +         */
> +        peer_node = node;
> +        do
> +        {
> +            /* Find out what the !idle are in this node */
> +            cpumask_andnot(&workers, online, prv->idlers);
> +            cpumask_and(&workers,&workers,&node_to_cpumask(peer_node));
> +            cpumask_clear_cpu(cpu,&workers);
>
> -    while ( !cpumask_empty(&workers) )
> -    {
> -        peer_cpu = cpumask_cycle(peer_cpu,&workers);
> -        cpumask_clear_cpu(peer_cpu,&workers);
> +            if ( cpumask_empty(&workers) )
> +                goto next_node;
>
> -        /*
> -         * Get ahold of the scheduler lock for this peer CPU.
> -         *
> -         * Note: We don't spin on this lock but simply try it. Spinning could
> -         * cause a deadlock if the peer CPU is also load balancing and trying
> -         * to lock this CPU.
> -         */
> -        if ( !pcpu_schedule_trylock(peer_cpu) )
> -        {
> -            SCHED_STAT_CRANK(steal_trylock_failed);
> -            continue;
> -        }
> +            peer_cpu = cpumask_first(&workers);
> +            do
> +            {
> +                /*
> +                 * Get ahold of the scheduler lock for this peer CPU.
> +                 *
> +                 * Note: We don't spin on this lock but simply try it. Spinning
> +                 * could cause a deadlock if the peer CPU is also load
> +                 * balancing and trying to lock this CPU.
> +                 */
> +                if ( !pcpu_schedule_trylock(peer_cpu) )
> +                {
> +                    SCHED_STAT_CRANK(steal_trylock_failed);
> +                    peer_cpu = cpumask_cycle(peer_cpu,&workers);
> +                    continue;
> +                }
>
> -        /*
> -         * Any work over there to steal?
> -         */
> -        speer = cpumask_test_cpu(peer_cpu, online) ?
> -            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
> -        pcpu_schedule_unlock(peer_cpu);
> -        if ( speer != NULL )
> -        {
> -            *stolen = 1;
> -            return speer;
> -        }
> +                /* Any work over there to steal? */
> +                speer = cpumask_test_cpu(peer_cpu, online) ?
> +                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> +                pcpu_schedule_unlock(peer_cpu);
> +
> +                /* As soon as one vcpu is found, balancing ends */
> +                if ( speer != NULL )
> +                {
> +                    *stolen = 1;
> +                    return speer;
> +                }
> +
> +                peer_cpu = cpumask_cycle(peer_cpu,&workers);
> +
> +            } while( peer_cpu != cpumask_first(&workers) );
> +
> + next_node:
> +            peer_node = cycle_node(peer_node, node_online_map);
> +        } while( peer_node != node );
>       }
>
>    out:
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -41,6 +41,8 @@
>    * int last_node(mask)			Number highest set bit, or MAX_NUMNODES
>    * int first_unset_node(mask)		First node not set in mask, or
>    *					MAX_NUMNODES.
> + * int cycle_node(node, mask)		Next node cycling from 'node', or
> + *					MAX_NUMNODES
>    *
>    * nodemask_t nodemask_of_node(node)	Return nodemask with bit 'node' set
>    * NODE_MASK_ALL			Initializer - all bits set
> @@ -254,6 +256,16 @@ static inline int __first_unset_node(con
>   			find_first_zero_bit(maskp->bits, MAX_NUMNODES));
>   }
>
> +#define cycle_node(n, src) __cycle_node((n),&(src), MAX_NUMNODES)
> +static inline int __cycle_node(int n, const nodemask_t *maskp, int nbits)
> +{
> +    int nxt = __next_node(n, maskp, nbits);
> +
> +    if (nxt == nbits)
> +        nxt = __first_node(maskp, nbits);
> +    return nxt;
> +}
> +
>   #define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES)
>
>   #if MAX_NUMNODES<= BITS_PER_LONG
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>


-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html