Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

All of lore.kernel.org
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Dan Magenheimer <dan.magenheimer@oracle.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Anil Madhavapeddy <anil@recoil.org>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Date: Thu, 20 Dec 2012 20:21:45 +0000	[thread overview]
Message-ID: <50D37359.9080001@eu.citrix.com> (raw)
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>

On 19/12/12 19:07, Dario Faggioli wrote:
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>   }
>
>   static int
> @@ -493,85 +599,98 @@ static int
>       cpumask_t idlers;
>       cpumask_t *online;
>       struct csched_pcpu *spc = NULL;
> +    int ret, balance_step;
>       int cpu;
>
> -    /*
> -     * Pick from online CPUs in VCPU's affinity mask, giving a
> -     * preference to its current processor if it's in there.
> -     */
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> -    cpumask_and(&cpus, online, vc->cpu_affinity);
> -    cpu = cpumask_test_cpu(vc->processor, &cpus)
> -            ? vc->processor
> -            : cpumask_cycle(vc->processor, &cpus);
> -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +    for_each_csched_balance_step( balance_step )
> +    {
> +        /* Pick an online CPU from the proper affinity mask */
> +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
> +        cpumask_and(&cpus, &cpus, online);
>
> -    /*
> -     * Try to find an idle processor within the above constraints.
> -     *
> -     * In multi-core and multi-threaded CPUs, not all idle execution
> -     * vehicles are equal!
> -     *
> -     * We give preference to the idle execution vehicle with the most
> -     * idling neighbours in its grouping. This distributes work across
> -     * distinct cores first and guarantees we don't do something stupid
> -     * like run two VCPUs on co-hyperthreads while there are idle cores
> -     * or sockets.
> -     *
> -     * Notice that, when computing the "idleness" of cpu, we may want to
> -     * discount vc. That is, iff vc is the currently running and the only
> -     * runnable vcpu on cpu, we add cpu to the idlers.
> -     */
> -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> -        cpumask_set_cpu(cpu, &idlers);
> -    cpumask_and(&cpus, &cpus, &idlers);
> -    cpumask_clear_cpu(cpu, &cpus);
> +        /* If present, prefer vc's current processor */
> +        cpu = cpumask_test_cpu(vc->processor, &cpus)
> +                ? vc->processor
> +                : cpumask_cycle(vc->processor, &cpus);
> +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>
> -    while ( !cpumask_empty(&cpus) )
> -    {
> -        cpumask_t cpu_idlers;
> -        cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> -        int migrate_factor;
> +        /*
> +         * Try to find an idle processor within the above constraints.
> +         *
> +         * In multi-core and multi-threaded CPUs, not all idle execution
> +         * vehicles are equal!
> +         *
> +         * We give preference to the idle execution vehicle with the most
> +         * idling neighbours in its grouping. This distributes work across
> +         * distinct cores first and guarantees we don't do something stupid
> +         * like run two VCPUs on co-hyperthreads while there are idle cores
> +         * or sockets.
> +         *
> +         * Notice that, when computing the "idleness" of cpu, we may want to
> +         * discount vc. That is, iff vc is the currently running and the only
> +         * runnable vcpu on cpu, we add cpu to the idlers.
> +         */
> +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> +            cpumask_set_cpu(cpu, &idlers);
> +        cpumask_and(&cpus, &cpus, &idlers);
> +        /* If there are idlers and cpu is still not among them, pick one */
> +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
> +            cpu = cpumask_cycle(cpu, &cpus);

This seems to be an addition to the algorithm -- particularly hidden in 
this kind of "indent a big section that's almost exactly the same", I 
think this at least needs to be called out in the changelog message, 
perhaps put in a separate patch.

Can you comment on to why you think it's necessary?  Was there a 
particular problem you were seeing?

> +        cpumask_clear_cpu(cpu, &cpus);
>
> -        nxt = cpumask_cycle(cpu, &cpus);
> +        while ( !cpumask_empty(&cpus) )
> +        {
> +            cpumask_t cpu_idlers;
> +            cpumask_t nxt_idlers;
> +            int nxt, weight_cpu, weight_nxt;
> +            int migrate_factor;
>
> -        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> -        {
> -            /* We're on the same socket, so check the busy-ness of threads.
> -             * Migrate if # of idlers is less at all */
> -            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 1;
> -            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu));
> -            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt));
> -        }
> -        else
> -        {
> -            /* We're on different sockets, so check the busy-ness of cores.
> -             * Migrate only if the other core is twice as idle */
> -            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 2;
> -            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> -            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> +            nxt = cpumask_cycle(cpu, &cpus);
> +
> +            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> +            {
> +                /* We're on the same socket, so check the busy-ness of threads.
> +                 * Migrate if # of idlers is less at all */
> +                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 1;
> +                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask,
> +                            cpu));
> +                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask,
> +                            nxt));
> +            }
> +            else
> +            {
> +                /* We're on different sockets, so check the busy-ness of cores.
> +                 * Migrate only if the other core is twice as idle */
> +                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 2;
> +                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> +                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> +            }
> +
> +            weight_cpu = cpumask_weight(&cpu_idlers);
> +            weight_nxt = cpumask_weight(&nxt_idlers);
> +            /* smt_power_savings: consolidate work rather than spreading it */
> +            if ( sched_smt_power_savings ?
> +                 weight_cpu > weight_nxt :
> +                 weight_cpu * migrate_factor < weight_nxt )
> +            {
> +                cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> +                spc = CSCHED_PCPU(nxt);
> +                cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> +                cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> +            }
> +            else
> +            {
> +                cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> +            }
>           }
>
> -        weight_cpu = cpumask_weight(&cpu_idlers);
> -        weight_nxt = cpumask_weight(&nxt_idlers);
> -        /* smt_power_savings: consolidate work rather than spreading it */
> -        if ( sched_smt_power_savings ?
> -             weight_cpu > weight_nxt :
> -             weight_cpu * migrate_factor < weight_nxt )
> -        {
> -            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> -            spc = CSCHED_PCPU(nxt);
> -            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> -            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> -        }
> -        else
> -        {
> -            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> -        }
> +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> +        if ( cpumask_test_cpu(cpu, &idlers) || ret )
> +            break;

Right -- OK, I think everything looks good here, except the "return -1 
from csched_balance_cpumask" thing.  I think it would be better if we 
explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped 
the NODE step if that's the case.

Also -- and sorry to have to ask this kind of thing, but after sorting 
through the placement algorithm my head hurts -- under what 
circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this 
point?  It seems like the only possibility would be if:
( (vc->processor was not in the original &cpus [1])
   || !IS_RUNQ_IDLE(vc->processor) )
&& (there are no idlers in the original &cpus)

Which I suppose probably matches the time when we want to move on from 
looking at NODE affinity and look for CPU affinity.

[1] This could happen either if the vcpu/node affinity has changed, or 
if we're currently running outside our node affinity and we're doing the 
NODE step.

OK -- I think I've convinced myself that this is OK as well (apart from 
the hidden check).  I'll come back to look at your response to the load 
balancing thing tomorrow.

  -George

next prev parent reply	other threads:[~2012-12-20 20:21 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20  9:17   ` Jan Beulich
2012-12-20  9:35     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20  9:18   ` Jan Beulich
2012-12-20  9:55     ` Dario Faggioli
2012-12-20 14:33     ` George Dunlap
2012-12-20 14:52       ` Jan Beulich
2012-12-20 15:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20  6:44   ` Juergen Gross
2012-12-20  8:16     ` Dario Faggioli
2012-12-20  8:25       ` Juergen Gross
2012-12-20  8:33         ` Dario Faggioli
2012-12-20  8:39           ` Juergen Gross
2012-12-20  8:58             ` Dario Faggioli
2012-12-20 15:28             ` George Dunlap
2012-12-20 16:00               ` Dario Faggioli
2012-12-20  9:22           ` Jan Beulich
2012-12-20 15:56   ` George Dunlap
2012-12-20 17:12     ` Dario Faggioli
2012-12-20 16:48   ` George Dunlap
2012-12-20 18:18     ` Dario Faggioli
2012-12-21 14:29       ` George Dunlap
2012-12-21 16:07         ` Dario Faggioli
2012-12-20 20:21   ` George Dunlap [this message]
2012-12-21  0:18     ` Dario Faggioli
2012-12-21 14:56       ` George Dunlap
2012-12-21 16:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17   ` George Dunlap
2012-12-21 16:17     ` Dario Faggioli
2013-01-03 16:05     ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19   ` George Dunlap
2012-12-21 16:27     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30   ` George Dunlap
2012-12-21 16:18     ` Dario Faggioli
2012-12-21 17:02       ` Ian Jackson
2012-12-21 17:09         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20  8:41   ` Ian Campbell
2012-12-20  9:24     ` Dario Faggioli
2012-12-21 16:00   ` George Dunlap
2012-12-21 16:23     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22   ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34   ` George Dunlap
2012-12-21 16:54     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57   ` Dario Faggioli
2013-01-11 14:09     ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50D37359.9080001@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Marcus.Granado@eu.citrix.com \
    --cc=anil@recoil.org \
    --cc=dan.magenheimer@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=msw@amazon.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.