From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the
 scheduler know about node-affinity
Date: Thu, 20 Dec 2012 16:48:13 +0000
Message-ID: <50D3414D.8080901@eu.citrix.com>
References: <patchbomb.1355944036@Solace>
	<06d2f322a6319d8ba212.1355944039@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>, Dan Magenheimer <dan.magenheimer@oracle.com>, Ian Campbell <Ian.Campbell@citrix.com>, Anil Madhavapeddy <anil@recoil.org>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>, Daniel De Graaf <dgdegra@tycho.nsa.gov>, Matt Wilson <msw@amazon.com>
List-Id: xen-devel@lists.xenproject.org

On 19/12/12 19:07, Dario Faggioli wrote:
>   static inline int
> -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
> +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>   {
>       /*
>        * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that's allowed to run on our CPU.
> +     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> +     * on our CPU.
>        */
>       return !vc->is_running &&
>              !__csched_vcpu_is_cache_hot(vc) &&
> -           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
> +           cpumask_test_cpu(dest_cpu, mask);
> +}
> +
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>   }

I don't get what this function is for.  The only time you call it is in 
csched_runq_steal(), immediately after calling 
__csched_vcpu_is_migrateable().  But is_migrateable() has already 
checked cpu_mask_test_cpu(cpu, mask).  So why do we need to check it again?

We could just replace this with cpumask_test_cpu(cpu, prv->idlers).  But 
that clause is going to be either true or false for every single 
iteration of all the loops, including the loops in 
csched_load_balance().  Wouldn't it make more sense to check it once in 
csched_load_balance(), rather than doing all those nested loops?

And in any case, looking at the caller of csched_load_balance(), it 
explicitly says to steal work if the next thing on the runqueue of cpu 
has a priority of TS_OVER.  That was chosen for a reason -- if you want 
to change that, you should change it there at the top (and make a 
justification for doing so), not deeply nested in a function like this.

Or am I completely missing something?

>   static struct csched_vcpu *
> -csched_runq_steal(int peer_cpu, int cpu, int pri)
> +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> +    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
>       const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
> @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
>               if ( speer->pri <= pri )
>                   break;
>
> -            /* Is this VCPU is runnable on our PCPU? */
> +            /* Is this VCPU runnable on our PCPU? */
>               vc = speer->vcpu;
>               BUG_ON( is_idle_vcpu(vc) );
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> +            /*
> +             * Retrieve the correct mask for this balance_step or, if we're
> +             * dealing with node-affinity and the vcpu has no node affinity
> +             * at all, just skip this vcpu. That is needed if we want to
> +             * check if we have any node-affine work to steal first (wrt
> +             * any vcpu-affine work).
> +             */
> +            if ( csched_balance_cpumask(vc, balance_step,
> +                                        &scratch_balance_mask) )
> +                continue;

Again, I think for clarity the best thing to do here is:

if ( balance_step == NODE

      && cpumask_full(speer->sdom->node_affinity_cpumask) )

     continue;

csched_balance_cpumask();

/* Etc. */


> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
>       struct csched_vcpu *speer;
>       cpumask_t workers;
>       cpumask_t *online;
> -    int peer_cpu;
> +    int peer_cpu, peer_node, bstep;
> +    int node = cpu_to_node(cpu);
>
>       BUG_ON( cpu != snext->vcpu->processor );
>       online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
> @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
>           SCHED_STAT_CRANK(load_balance_other);
>
>       /*
> -     * Peek at non-idling CPUs in the system, starting with our
> -     * immediate neighbour.
> +     * Let's look around for work to steal, taking both vcpu-affinity
> +     * and node-affinity into account. More specifically, we check all
> +     * the non-idle CPUs' runq, looking for:
> +     *  1. any node-affine work to steal first,
> +     *  2. if not finding anything, any vcpu-affine work to steal.
>        */
> -    cpumask_andnot(&workers, online, prv->idlers);
> -    cpumask_clear_cpu(cpu, &workers);
> -    peer_cpu = cpu;
> +    for_each_csched_balance_step( bstep )
> +    {
> +        /*
> +         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
> +         * it is more likely that we find some node-affine work on our same
> +         * node, not to mention that migrating vcpus within the same node
> +         * could well expected to be cheaper than across-nodes (memory
> +         * stays local, there might be some node-wide cache[s], etc.).
> +         */
> +        peer_node = node;
> +        do
> +        {
> +            /* Find out what the !idle are in this node */
> +            cpumask_andnot(&workers, online, prv->idlers);
> +            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
> +            cpumask_clear_cpu(cpu, &workers);
>
> -    while ( !cpumask_empty(&workers) )
> -    {
> -        peer_cpu = cpumask_cycle(peer_cpu, &workers);
> -        cpumask_clear_cpu(peer_cpu, &workers);
> +            if ( cpumask_empty(&workers) )
> +                goto next_node;
>
> -        /*
> -         * Get ahold of the scheduler lock for this peer CPU.
> -         *
> -         * Note: We don't spin on this lock but simply try it. Spinning could
> -         * cause a deadlock if the peer CPU is also load balancing and trying
> -         * to lock this CPU.
> -         */
> -        if ( !pcpu_schedule_trylock(peer_cpu) )
> -        {
> -            SCHED_STAT_CRANK(steal_trylock_failed);
> -            continue;
> -        }
> +            peer_cpu = cpumask_first(&workers);
> +            do
> +            {
> +                /*
> +                 * Get ahold of the scheduler lock for this peer CPU.
> +                 *
> +                 * Note: We don't spin on this lock but simply try it. Spinning
> +                 * could cause a deadlock if the peer CPU is also load
> +                 * balancing and trying to lock this CPU.
> +                 */
> +                if ( !pcpu_schedule_trylock(peer_cpu) )
> +                {
> +                    SCHED_STAT_CRANK(steal_trylock_failed);
> +                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +                    continue;
> +                }
>
> -        /*
> -         * Any work over there to steal?
> -         */
> -        speer = cpumask_test_cpu(peer_cpu, online) ?
> -            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
> -        pcpu_schedule_unlock(peer_cpu);
> -        if ( speer != NULL )
> -        {
> -            *stolen = 1;
> -            return speer;
> -        }
> +                /* Any work over there to steal? */
> +                speer = cpumask_test_cpu(peer_cpu, online) ?
> +                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> +                pcpu_schedule_unlock(peer_cpu);
> +
> +                /* As soon as one vcpu is found, balancing ends */
> +                if ( speer != NULL )
> +                {
> +                    *stolen = 1;
> +                    return speer;
> +                }
> +
> +                peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +
> +            } while( peer_cpu != cpumask_first(&workers) );
> +
> + next_node:
> +            peer_node = cycle_node(peer_node, node_online_map);
> +        } while( peer_node != node );
>       }

These changes all look right.  But then, I'm a bit tired, so I'll give 
it another once-over tomorrow. :-)

[To be continued]