From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the
 scheduler know about node-affinity
Date: Thu, 20 Dec 2012 20:21:45 +0000
Message-ID: <50D37359.9080001@eu.citrix.com>
References: <patchbomb.1355944036@Solace>
	<06d2f322a6319d8ba212.1355944039@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>, Dan Magenheimer <dan.magenheimer@oracle.com>, Ian Campbell <Ian.Campbell@citrix.com>, Anil Madhavapeddy <anil@recoil.org>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>, Daniel De Graaf <dgdegra@tycho.nsa.gov>, Matt Wilson <msw@amazon.com>
List-Id: xen-devel@lists.xenproject.org

On 19/12/12 19:07, Dario Faggioli wrote:
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>   }
>
>   static int
> @@ -493,85 +599,98 @@ static int
>       cpumask_t idlers;
>       cpumask_t *online;
>       struct csched_pcpu *spc = NULL;
> +    int ret, balance_step;
>       int cpu;
>
> -    /*
> -     * Pick from online CPUs in VCPU's affinity mask, giving a
> -     * preference to its current processor if it's in there.
> -     */
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> -    cpumask_and(&cpus, online, vc->cpu_affinity);
> -    cpu = cpumask_test_cpu(vc->processor, &cpus)
> -            ? vc->processor
> -            : cpumask_cycle(vc->processor, &cpus);
> -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +    for_each_csched_balance_step( balance_step )
> +    {
> +        /* Pick an online CPU from the proper affinity mask */
> +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
> +        cpumask_and(&cpus, &cpus, online);
>
> -    /*
> -     * Try to find an idle processor within the above constraints.
> -     *
> -     * In multi-core and multi-threaded CPUs, not all idle execution
> -     * vehicles are equal!
> -     *
> -     * We give preference to the idle execution vehicle with the most
> -     * idling neighbours in its grouping. This distributes work across
> -     * distinct cores first and guarantees we don't do something stupid
> -     * like run two VCPUs on co-hyperthreads while there are idle cores
> -     * or sockets.
> -     *
> -     * Notice that, when computing the "idleness" of cpu, we may want to
> -     * discount vc. That is, iff vc is the currently running and the only
> -     * runnable vcpu on cpu, we add cpu to the idlers.
> -     */
> -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> -        cpumask_set_cpu(cpu, &idlers);
> -    cpumask_and(&cpus, &cpus, &idlers);
> -    cpumask_clear_cpu(cpu, &cpus);
> +        /* If present, prefer vc's current processor */
> +        cpu = cpumask_test_cpu(vc->processor, &cpus)
> +                ? vc->processor
> +                : cpumask_cycle(vc->processor, &cpus);
> +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>
> -    while ( !cpumask_empty(&cpus) )
> -    {
> -        cpumask_t cpu_idlers;
> -        cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> -        int migrate_factor;
> +        /*
> +         * Try to find an idle processor within the above constraints.
> +         *
> +         * In multi-core and multi-threaded CPUs, not all idle execution
> +         * vehicles are equal!
> +         *
> +         * We give preference to the idle execution vehicle with the most
> +         * idling neighbours in its grouping. This distributes work across
> +         * distinct cores first and guarantees we don't do something stupid
> +         * like run two VCPUs on co-hyperthreads while there are idle cores
> +         * or sockets.
> +         *
> +         * Notice that, when computing the "idleness" of cpu, we may want to
> +         * discount vc. That is, iff vc is the currently running and the only
> +         * runnable vcpu on cpu, we add cpu to the idlers.
> +         */
> +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> +            cpumask_set_cpu(cpu, &idlers);
> +        cpumask_and(&cpus, &cpus, &idlers);
> +        /* If there are idlers and cpu is still not among them, pick one */
> +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
> +            cpu = cpumask_cycle(cpu, &cpus);

This seems to be an addition to the algorithm -- particularly hidden in 
this kind of "indent a big section that's almost exactly the same", I 
think this at least needs to be called out in the changelog message, 
perhaps put in a separate patch.

Can you comment on to why you think it's necessary?  Was there a 
particular problem you were seeing?

> +        cpumask_clear_cpu(cpu, &cpus);
>
> -        nxt = cpumask_cycle(cpu, &cpus);
> +        while ( !cpumask_empty(&cpus) )
> +        {
> +            cpumask_t cpu_idlers;
> +            cpumask_t nxt_idlers;
> +            int nxt, weight_cpu, weight_nxt;
> +            int migrate_factor;
>
> -        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> -        {
> -            /* We're on the same socket, so check the busy-ness of threads.
> -             * Migrate if # of idlers is less at all */
> -            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 1;
> -            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu));
> -            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt));
> -        }
> -        else
> -        {
> -            /* We're on different sockets, so check the busy-ness of cores.
> -             * Migrate only if the other core is twice as idle */
> -            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 2;
> -            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> -            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> +            nxt = cpumask_cycle(cpu, &cpus);
> +
> +            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> +            {
> +                /* We're on the same socket, so check the busy-ness of threads.
> +                 * Migrate if # of idlers is less at all */
> +                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 1;
> +                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask,
> +                            cpu));
> +                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask,
> +                            nxt));
> +            }
> +            else
> +            {
> +                /* We're on different sockets, so check the busy-ness of cores.
> +                 * Migrate only if the other core is twice as idle */
> +                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 2;
> +                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> +                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> +            }
> +
> +            weight_cpu = cpumask_weight(&cpu_idlers);
> +            weight_nxt = cpumask_weight(&nxt_idlers);
> +            /* smt_power_savings: consolidate work rather than spreading it */
> +            if ( sched_smt_power_savings ?
> +                 weight_cpu > weight_nxt :
> +                 weight_cpu * migrate_factor < weight_nxt )
> +            {
> +                cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> +                spc = CSCHED_PCPU(nxt);
> +                cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> +                cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> +            }
> +            else
> +            {
> +                cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> +            }
>           }
>
> -        weight_cpu = cpumask_weight(&cpu_idlers);
> -        weight_nxt = cpumask_weight(&nxt_idlers);
> -        /* smt_power_savings: consolidate work rather than spreading it */
> -        if ( sched_smt_power_savings ?
> -             weight_cpu > weight_nxt :
> -             weight_cpu * migrate_factor < weight_nxt )
> -        {
> -            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> -            spc = CSCHED_PCPU(nxt);
> -            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> -            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> -        }
> -        else
> -        {
> -            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> -        }
> +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> +        if ( cpumask_test_cpu(cpu, &idlers) || ret )
> +            break;

Right -- OK, I think everything looks good here, except the "return -1 
from csched_balance_cpumask" thing.  I think it would be better if we 
explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped 
the NODE step if that's the case.

Also -- and sorry to have to ask this kind of thing, but after sorting 
through the placement algorithm my head hurts -- under what 
circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this 
point?  It seems like the only possibility would be if:
( (vc->processor was not in the original &cpus [1])
   || !IS_RUNQ_IDLE(vc->processor) )
&& (there are no idlers in the original &cpus)

Which I suppose probably matches the time when we want to move on from 
looking at NODE affinity and look for CPU affinity.

[1] This could happen either if the vcpu/node affinity has changed, or 
if we're currently running outside our node affinity and we're doing the 
NODE step.

OK -- I think I've convinced myself that this is OK as well (apart from 
the hidden check).  I'll come back to look at your response to the load 
balancing thing tomorrow.

  -George